Skip to content

Instantly share code, notes, and snippets.

@asb

asb/README.md Secret

Last active January 7, 2026 19:59
Show Gist options
  • Select an option

  • Save asb/44fe17f4f5b7abed7836481be45c5a38 to your computer and use it in GitHub Desktop.

Select an option

Save asb/44fe17f4f5b7abed7836481be45c5a38 to your computer and use it in GitHub Desktop.
Data for per-query energy consumption of LLMs article

Calculating Watt-hours per query from InferenceMAX results

This Gist contains the scripts, input JSON, and script output generated while working to calculate a Watt-hours per query figure using the InferenceMAX LLM serving benchmarks. See the blog post on Muxup.com for full details and explanation.

The JSON files are downloaded artifacts from a December 15th 2025 InferenceMAX run (JSON in the .zip available from the "collect-dsr1-results" and "collect-gptoss-results" jobs):

There's some additional results collected from a December 21st 2025 InferenceMAX run as these capture the performance improvement from a set of newly added GB200 NVL72 configurations. The .json was taken from https://github.com/InferenceMAX/InferenceMAX/actions/runs/20413316138/job/58699979731.

File overview:

  • agg_*.json: Benchmark results JSON. agg_bmk_additional.json contains the newer GB200 results, containing figures for fp4+fp8 DeepSeek R1 0528 for both 8k/1k and 1k/1k workloads.
  • _agg_*_result.md: Output when running the matching .json file through process_results.py. For DeepSeek R1 which has fp4 and fp8 results, we split these into separate output files (except for _agg_bmk_additional_result.md. Prefixed with _ so they show up higher in the Gist view.
  • process_results.py: Parses the given results .json and calculates a Wh per query figure. Emits this in a sorted Markdown table.
  • check.py: Parses the given results .json and checks for basic anomalies. Primarily checking that the ratio of input to output token throughput matches the ratio of input sequence length to output sequence length, as expected.
  • README.md: You are here!
Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 1k/1k 21.8 76.2 gb200 dynamo-sglang (48 GPUs disagg, conc: 15000, pfill_dp_attn, dec_dp_attn) 0.17 0.05
fp4 DS R1 0528 1k/1k 22.0 61.4 gb200 dynamo-sglang (48 GPUs disagg, conc: 12000, pfill_dp_attn, dec_dp_attn) 0.17 0.05
fp4 DS R1 0528 1k/1k 26.1 45.8 gb200 dynamo-sglang (48 GPUs disagg, conc: 8192, pfill_dp_attn, dec_dp_attn) 0.19 0.06
fp4 DS R1 0528 1k/1k 39.5 42.0 gb200 dynamo-sglang (64 GPUs disagg, conc: 8192, pfill_dp_attn, dec_dp_attn) 0.23 0.07
fp4 DS R1 0528 1k/1k 45.0 24.2 gb200 dynamo-sglang (64 GPUs disagg, conc: 4096, pfill_dp_attn, dec_dp_attn) 0.26 0.08
fp8 DS R1 0528 1k/1k 30.5 34.3 gb200 dynamo-sglang (48 GPUs disagg, conc: 4096, pfill_dp_attn, dec_dp_attn) 0.28 0.08
fp8 DS R1 0528 1k/1k 42.5 24.9 gb200 dynamo-sglang (72 GPUs disagg, conc: 4096, pfill_dp_attn, dec_dp_attn) 0.31 0.09
fp4 DS R1 0528 1k/1k 56.2 18.5 gb200 dynamo-sglang (64 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.40 0.12
fp8 DS R1 0528 1k/1k 48.0 21.4 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.53 0.16
fp4 DS R1 0528 1k/1k 63.6 16.0 gb200 dynamo-sglang (64 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 0.70 0.21
fp4 DS R1 0528 8k/1k 47.3 123.6 gb200 dynamo-sglang (72 GPUs disagg, conc: 8192, pfill_dp_attn, dec_dp_attn) 0.79 0.24
fp8 DS R1 0528 8k/1k 39.1 98.5 gb200 dynamo-sglang (72 GPUs disagg, conc: 6144, pfill_dp_attn, dec_dp_attn) 0.86 0.26
fp4 DS R1 0528 8k/1k 47.4 33.8 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.87 0.26
fp8 DS R1 0528 8k/1k 39.5 36.5 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.96 0.29
fp4 DS R1 0528 8k/1k 49.9 21.9 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 1.12 0.33
fp4 DS R1 0528 1k/1k 67.7 14.5 gb200 dynamo-sglang (64 GPUs disagg, conc: 512, pfill_dp_attn, dec_dp_attn) 1.26 0.38
fp4 DS R1 0528 8k/1k 60.6 102.5 gb200 dynamo-sglang (72 GPUs disagg, conc: 4096, dec_dp_attn) 1.32 0.39
fp8 DS R1 0528 8k/1k 42.7 26.5 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 1.32 0.40
fp4 DS R1 0528 8k/1k 60.9 52.8 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, dec_dp_attn) 1.34 0.40
fp4 DS R1 0528 8k/1k 61.4 27.9 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, dec_dp_attn) 1.43 0.43
fp4 DS R1 0528 8k/1k 63.7 17.4 gb200 dynamo-sglang (72 GPUs disagg, conc: 512, dec_dp_attn) 1.73 0.52
fp4 DS R1 0528 1k/1k 70.1 13.8 gb200 dynamo-sglang (12 GPUs disagg, conc: 64, pfill_dp_attn, dec_dp_attn) 1.75 0.53
fp8 DS R1 0528 1k/1k 50.4 42.6 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 2.07 0.62
fp8 DS R1 0528 1k/1k 50.3 19.7 gb200 dynamo-sglang (20 GPUs disagg, conc: 128, pfill_dp_attn, dec_dp_attn) 2.12 0.64
fp8 DS R1 0528 8k/1k 46.5 23.2 gb200 dynamo-sglang (72 GPUs disagg, conc: 512, pfill_dp_attn, dec_dp_attn) 2.28 0.68
fp4 DS R1 0528 1k/1k 93.5 10.5 gb200 dynamo-sglang (12 GPUs disagg, conc: 32, pfill_dp_attn, dec_dp_attn) 2.67 0.80
fp4 DS R1 0528 8k/1k 86.5 13.3 gb200 dynamo-sglang (20 GPUs disagg, conc: 64, dec_dp_attn) 2.88 0.86
fp8 DS R1 0528 1k/1k 65.6 14.9 gb200 dynamo-sglang (20 GPUs disagg, conc: 64, pfill_dp_attn, dec_dp_attn) 3.18 0.95
fp8 DS R1 0528 8k/1k 47.0 22.3 gb200 dynamo-sglang (8 GPUs disagg, conc: 32, pfill_dp_attn, dec_dp_attn) 3.86 1.16
fp4 DS R1 0528 8k/1k 108.0 9.5 gb200 dynamo-sglang (20 GPUs disagg, conc: 32, dec_dp_attn) 4.16 1.25
fp8 DS R1 0528 8k/1k 63.3 16.3 gb200 dynamo-sglang (8 GPUs disagg, conc: 16, pfill_dp_attn, dec_dp_attn) 5.57 1.67
fp4 DS R1 0528 1k/1k 133.8 7.1 gb200 dynamo-sglang (12 GPUs disagg, conc: 8, pfill_dp_attn, dec_dp_attn) 7.22 2.17
fp8 DS R1 0528 1k/1k 98.8 9.9 gb200 dynamo-sglang (20 GPUs disagg, conc: 16, pfill_dp_attn, dec_dp_attn) 8.35 2.51
fp8 DS R1 0528 8k/1k 80.6 12.8 gb200 dynamo-sglang (8 GPUs disagg, conc: 8, pfill_dp_attn, dec_dp_attn) 8.71 2.61
fp4 DS R1 0528 8k/1k 127.7 7.7 gb200 dynamo-sglang (20 GPUs disagg, conc: 8, dec_dp_attn) 13.0 3.91
fp8 DS R1 0528 8k/1k 95.8 10.8 gb200 dynamo-sglang (8 GPUs disagg, conc: 4, pfill_dp_attn, dec_dp_attn) 14.4 4.32
fp4 DS R1 0528 1k/1k 133.5 7.2 gb200 dynamo-sglang (12 GPUs disagg, conc: 4, pfill_dp_attn, dec_dp_attn) 14.5 4.34
fp8 DS R1 0528 1k/1k 99.8 9.5 gb200 dynamo-sglang (20 GPUs disagg, conc: 8, pfill_dp_attn, dec_dp_attn) 16.2 4.86
fp4 DS R1 0528 8k/1k 132.0 7.4 gb200 dynamo-sglang (20 GPUs disagg, conc: 4, dec_dp_attn) 24.7 7.41
fp8 DS R1 0528 1k/1k 104.4 9.2 gb200 dynamo-sglang (20 GPUs disagg, conc: 4, pfill_dp_attn, dec_dp_attn) 31.2 9.37
fp8 DS R1 0528 1k/1k 109.0 8.7 gb200 dynamo-sglang (20 GPUs disagg, conc: 2, pfill_dp_attn, dec_dp_attn) 58.2 17.47
Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 1k/1k 19.0 62.1 gb200 dynamo-trt (12 GPUs disagg, conc: 4300, pfill_dp_attn, dec_dp_attn) 0.12 0.04
fp4 DS R1 0528 1k/1k 31.7 34.5 gb200 dynamo-trt (12 GPUs disagg, conc: 2252, mtp, pfill_dp_attn, dec_dp_attn) 0.12 0.04
fp4 DS R1 0528 1k/1k 29.7 35.4 gb200 dynamo-trt (24 GPUs disagg, conc: 4300, pfill_dp_attn, dec_dp_attn) 0.14 0.04
fp4 DS R1 0528 1k/1k 51.6 20.1 gb200 dynamo-trt (24 GPUs disagg, conc: 2150, mtp, pfill_dp_attn, dec_dp_attn) 0.15 0.05
fp4 DS R1 0528 1k/1k 40.9 24.1 gb200 dynamo-trt (24 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.19 0.06
fp4 DS R1 0528 1k/1k 66.7 16.7 gb200 dynamo-trt (20 GPUs disagg, conc: 1075, mtp, pfill_dp_attn, dec_dp_attn) 0.21 0.06
fp4 DS R1 0528 1k/1k 48.7 20.9 gb200 dynamo-trt (20 GPUs disagg, conc: 1075, pfill_dp_attn, dec_dp_attn) 0.27 0.08
fp4 DS R1 0528 1k/1k 86.9 11.7 gb200 dynamo-trt (20 GPUs disagg, conc: 512, mtp, pfill_dp_attn, dec_dp_attn) 0.31 0.09
fp4 DS R1 0528 1k/1k 29.0 32.9 b200-trt trt (4 GPUs, conc: 256, dp_attn) 0.36 0.11
fp4 DS R1 0528 1k/1k 55.4 18.5 gb200 dynamo-trt (36 GPUs disagg, conc: 1075, pfill_dp_attn, dec_dp_attn) 0.43 0.13
fp4 DS R1 0528 1k/1k 103.4 10.1 gb200 dynamo-trt (36 GPUs disagg, conc: 512, mtp, pfill_dp_attn, dec_dp_attn) 0.50 0.15
fp4 DS R1 0528 1k/1k 36.5 26.5 b200-trt trt (8 GPUs, conc: 256, dp_attn) 0.58 0.17
fp4 DS R1 0528 1k/1k 34.6 27.2 b200-trt trt (4 GPUs, conc: 128) 0.58 0.18
fp4 DS R1 0528 1k/1k 37.1 32.7 b200 sglang (4 GPUs, conc: 128) 0.70 0.21
fp4 DS R1 0528 1k/1k 60.2 17.7 gb200 dynamo-trt (36 GPUs disagg, conc: 564) 0.77 0.23
fp4 DS R1 0528 1k/1k 44.8 21.1 b200-trt trt (4 GPUs, conc: 64) 0.90 0.27
fp4 DS R1 0528 1k/1k 37.0 25.6 b200-trt trt (8 GPUs, conc: 128) 1.10 0.33
fp4 DS R1 0528 1k/1k 48.1 26.3 b200 sglang (4 GPUs, conc: 64) 1.13 0.34
fp4 DS R1 0528 1k/1k 46.1 28.0 b200 sglang (8 GPUs, conc: 128) 1.20 0.36
fp4 DS R1 0528 1k/1k 68.7 14.0 gb200 dynamo-trt (36 GPUs disagg, conc: 256) 1.34 0.40
fp4 DS R1 0528 1k/1k 58.2 16.1 b200-trt trt (4 GPUs, conc: 32) 1.38 0.41
fp4 DS R1 0528 1k/1k 130.2 8.3 gb200 dynamo-trt (36 GPUs disagg, conc: 144, mtp) 1.42 0.43
fp4 DS R1 0528 1k/1k 56.0 16.8 b200-trt trt (8 GPUs, conc: 64) 1.45 0.43
fp4 DS R1 0528 1k/1k 25.9 35.8 mi355x sglang (4 GPUs, conc: 64) 1.84 0.55
fp4 DS R1 0528 1k/1k 60.6 22.6 b200 sglang (4 GPUs, conc: 32) 1.93 0.58
fp4 DS R1 0528 1k/1k 59.3 22.8 b200 sglang (8 GPUs, conc: 64) 1.95 0.58
fp4 DS R1 0528 1k/1k 77.8 12.2 b200 sglang (4 GPUs, conc: 16) 2.09 0.63
fp4 DS R1 0528 1k/1k 83.4 11.5 gb200 dynamo-trt (36 GPUs disagg, conc: 128) 2.18 0.65
fp4 DS R1 0528 1k/1k 69.7 13.3 b200-trt trt (4 GPUs, conc: 16) 2.28 0.68
fp4 DS R1 0528 1k/1k 68.6 13.7 b200-trt trt (8 GPUs, conc: 32) 2.35 0.71
fp4 DS R1 0528 1k/1k 39.4 23.6 mi355x sglang (8 GPUs, conc: 64) 2.43 0.73
fp4 DS R1 0528 1k/1k 150.1 6.5 gb200 dynamo-trt (36 GPUs disagg, conc: 64, mtp) 2.49 0.75
fp4 DS R1 0528 1k/1k 31.6 29.4 mi355x sglang (4 GPUs, conc: 32) 3.02 0.91
fp4 DS R1 0528 1k/1k 97.9 9.6 b200-trt trt (4 GPUs, conc: 8) 3.26 0.98
fp4 DS R1 0528 1k/1k 75.0 19.3 b200 sglang (8 GPUs, conc: 32) 3.28 0.98
fp4 DS R1 0528 1k/1k 95.7 9.9 b200 sglang (4 GPUs, conc: 8) 3.38 1.01
fp4 DS R1 0528 1k/1k 91.5 10.4 b200 sglang (8 GPUs, conc: 16) 3.58 1.07
fp4 DS R1 0528 1k/1k 99.7 9.6 gb200 dynamo-trt (36 GPUs disagg, conc: 64) 3.60 1.08
fp4 DS R1 0528 1k/1k 51.2 18.1 mi355x sglang (8 GPUs, conc: 32) 3.73 1.12
fp4 DS R1 0528 1k/1k 81.6 11.4 b200-trt trt (8 GPUs, conc: 16) 3.90 1.17
fp4 DS R1 0528 1k/1k 181.5 5.3 gb200 dynamo-trt (36 GPUs disagg, conc: 32, mtp) 4.05 1.22
fp4 DS R1 0528 1k/1k 38.7 23.9 mi355x sglang (4 GPUs, conc: 16) 4.93 1.48
fp4 DS R1 0528 1k/1k 119.7 7.8 b200-trt trt (4 GPUs, conc: 4) 5.37 1.61
fp4 DS R1 0528 1k/1k 116.1 8.1 b200 sglang (4 GPUs, conc: 4) 5.58 1.67
fp4 DS R1 0528 1k/1k 114.8 8.2 b200-trt trt (8 GPUs, conc: 8) 5.59 1.68
fp4 DS R1 0528 1k/1k 109.1 8.8 b200 sglang (8 GPUs, conc: 8) 5.99 1.80
fp4 DS R1 0528 1k/1k 62.3 14.9 mi355x sglang (8 GPUs, conc: 16) 6.16 1.85
fp4 DS R1 0528 1k/1k 116.2 8.1 gb200 dynamo-trt (36 GPUs disagg, conc: 32) 6.16 1.85
fp4 DS R1 0528 1k/1k 213.1 4.5 gb200 dynamo-trt (36 GPUs disagg, conc: 16, mtp) 6.81 2.04
fp4 DS R1 0528 1k/1k 52.0 18.0 mi355x sglang (4 GPUs, conc: 8) 7.40 2.22
fp4 DS R1 0528 1k/1k 135.7 6.9 b200-trt trt (8 GPUs, conc: 4) 9.47 2.84
fp4 DS R1 0528 1k/1k 127.1 7.4 b200 sglang (8 GPUs, conc: 4) 10.3 3.08
fp4 DS R1 0528 1k/1k 133.0 7.1 gb200 dynamo-trt (36 GPUs disagg, conc: 16) 10.6 3.17
fp4 DS R1 0528 1k/1k 71.7 13.1 mi355x sglang (8 GPUs, conc: 8) 10.8 3.23
fp4 DS R1 0528 1k/1k 64.9 14.4 mi355x sglang (4 GPUs, conc: 4) 11.9 3.56
fp4 DS R1 0528 1k/1k 247.8 3.9 gb200 dynamo-trt (36 GPUs disagg, conc: 8, mtp) 12.0 3.60
fp4 DS R1 0528 1k/1k 148.6 6.4 gb200 dynamo-trt (36 GPUs disagg, conc: 8) 19.5 5.85
fp4 DS R1 0528 1k/1k 77.2 12.1 mi355x sglang (8 GPUs, conc: 4) 20.0 5.99
fp4 DS R1 0528 1k/1k 275.0 4.0 gb200 dynamo-trt (36 GPUs disagg, conc: 4, mtp) 24.4 7.31
fp4 DS R1 0528 1k/1k 156.6 6.1 gb200 dynamo-trt (36 GPUs disagg, conc: 4) 37.0 11.09

[INFO] Filtered out 45 result(s) based on args

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 1k/1k 30.0 34.6 gb200 dynamo-sglang (48 GPUs disagg, conc: 4096, pfill_dp_attn, dec_dp_attn) 0.28 0.08
fp8 DS R1 0528 1k/1k 42.0 25.4 gb200 dynamo-sglang (72 GPUs disagg, conc: 4096, pfill_dp_attn, dec_dp_attn) 0.31 0.09
fp8 DS R1 0528 1k/1k 47.5 21.6 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.53 0.16
fp8 DS R1 0528 1k/1k 31.8 29.3 h200 sglang (8 GPUs, conc: 64) 2.00 0.60
fp8 DS R1 0528 1k/1k 50.3 41.1 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 2.00 0.60
fp8 DS R1 0528 1k/1k 40.1 24.4 b200 sglang (8 GPUs, conc: 64) 2.05 0.62
fp8 DS R1 0528 1k/1k 30.2 31.5 h200 trt (8 GPUs, conc: 64) 2.15 0.65
fp8 DS R1 0528 1k/1k 49.7 20.0 gb200 dynamo-sglang (20 GPUs disagg, conc: 128, pfill_dp_attn, dec_dp_attn) 2.18 0.65
fp8 DS R1 0528 1k/1k 35.7 27.4 b200-trt trt (8 GPUs, conc: 64, dp_attn) 2.39 0.72
fp8 DS R1 0528 1k/1k 24.2 38.6 mi300x sglang (8 GPUs, conc: 64) 2.74 0.82
fp8 DS R1 0528 1k/1k 56.6 16.7 b200 sglang (8 GPUs, conc: 32) 2.81 0.84
fp8 DS R1 0528 1k/1k 32.1 28.9 mi355x sglang (8 GPUs, conc: 64) 2.98 0.89
fp8 DS R1 0528 1k/1k 26.9 34.7 mi325x sglang (8 GPUs, conc: 64) 3.01 0.90
fp8 DS R1 0528 1k/1k 64.4 15.1 gb200 dynamo-sglang (20 GPUs disagg, conc: 64, pfill_dp_attn, dec_dp_attn) 3.24 0.97
fp8 DS R1 0528 1k/1k 38.8 24.1 h200 sglang (8 GPUs, conc: 32) 3.28 0.98
fp8 DS R1 0528 1k/1k 38.6 24.4 h200 trt (8 GPUs, conc: 32) 3.33 1.00
fp8 DS R1 0528 1k/1k 44.7 21.0 b200-trt trt (8 GPUs, conc: 32) 3.60 1.08
fp8 DS R1 0528 1k/1k 31.3 29.9 mi300x sglang (8 GPUs, conc: 32) 4.24 1.27
fp8 DS R1 0528 1k/1k 73.0 12.9 b200 sglang (8 GPUs, conc: 16) 4.36 1.31
fp8 DS R1 0528 1k/1k 56.2 16.5 h200 sglang (8 GPUs, conc: 16) 4.49 1.35
fp8 DS R1 0528 1k/1k 34.7 27.0 mi325x sglang (8 GPUs, conc: 32) 4.66 1.40
fp8 DS R1 0528 1k/1k 37.5 24.7 mi355x sglang (8 GPUs, conc: 32) 5.08 1.52
fp8 DS R1 0528 1k/1k 49.5 19.1 h200 trt (8 GPUs, conc: 16) 5.22 1.57
fp8 DS R1 0528 1k/1k 53.0 17.6 b200-trt trt (8 GPUs, conc: 16) 6.02 1.81
fp8 DS R1 0528 1k/1k 93.4 10.1 b200 sglang (8 GPUs, conc: 8) 6.78 2.03
fp8 DS R1 0528 1k/1k 74.9 12.6 h200 sglang (8 GPUs, conc: 8) 6.81 2.04
fp8 DS R1 0528 1k/1k 37.6 24.7 mi300x sglang (8 GPUs, conc: 16) 6.98 2.09
fp8 DS R1 0528 1k/1k 39.6 23.4 mi325x sglang (8 GPUs, conc: 16) 8.06 2.42
fp8 DS R1 0528 1k/1k 98.6 9.9 gb200 dynamo-sglang (20 GPUs disagg, conc: 16, pfill_dp_attn, dec_dp_attn) 8.39 2.52
fp8 DS R1 0528 1k/1k 61.2 15.5 h200 trt (8 GPUs, conc: 8) 8.39 2.52
fp8 DS R1 0528 1k/1k 44.5 20.8 mi355x sglang (8 GPUs, conc: 16) 8.57 2.57
fp8 DS R1 0528 1k/1k 64.5 14.6 b200-trt trt (8 GPUs, conc: 8) 9.96 2.99
fp8 DS R1 0528 1k/1k 114.1 8.2 b200 sglang (8 GPUs, conc: 4) 11.1 3.33
fp8 DS R1 0528 1k/1k 85.7 10.9 h200 sglang (8 GPUs, conc: 4) 12.0 3.59
fp8 DS R1 0528 1k/1k 43.3 21.7 mi300x sglang (8 GPUs, conc: 8) 12.2 3.67
fp8 DS R1 0528 1k/1k 72.4 13.0 h200 trt (8 GPUs, conc: 4) 14.3 4.28
fp8 DS R1 0528 1k/1k 43.1 21.7 mi325x sglang (8 GPUs, conc: 8) 14.9 4.47
fp8 DS R1 0528 1k/1k 50.1 18.6 mi355x sglang (8 GPUs, conc: 8) 15.3 4.58
fp8 DS R1 0528 1k/1k 99.6 9.6 gb200 dynamo-sglang (20 GPUs disagg, conc: 8, pfill_dp_attn, dec_dp_attn) 16.3 4.88
fp8 DS R1 0528 1k/1k 68.6 13.6 b200-trt trt (8 GPUs, conc: 4) 18.7 5.62
fp8 DS R1 0528 1k/1k 46.7 19.9 mi300x sglang (8 GPUs, conc: 4) 22.7 6.81
fp8 DS R1 0528 1k/1k 45.8 20.2 mi325x sglang (8 GPUs, conc: 4) 28.1 8.43
fp8 DS R1 0528 1k/1k 54.1 17.1 mi355x sglang (8 GPUs, conc: 4) 28.2 8.45
fp8 DS R1 0528 1k/1k 100.5 9.6 gb200 dynamo-sglang (20 GPUs disagg, conc: 4, pfill_dp_attn, dec_dp_attn) 32.2 9.67
fp8 DS R1 0528 1k/1k 100.1 9.4 gb200 dynamo-sglang (20 GPUs disagg, conc: 2, pfill_dp_attn, dec_dp_attn) 62.3 18.70

[INFO] Filtered out 60 result(s) based on args

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 1k/8k 29.7 251.5 b200-trt trt (4 GPUs, conc: 256, dp_attn) 2.73 0.82
fp4 DS R1 0528 1k/8k 37.7 197.5 b200-trt trt (8 GPUs, conc: 256, dp_attn) 4.31 1.29
fp4 DS R1 0528 1k/8k 34.2 221.2 b200 sglang (4 GPUs, conc: 128) 4.75 1.43
fp4 DS R1 0528 1k/8k 33.1 223.1 b200-trt trt (4 GPUs, conc: 128) 4.79 1.44
fp4 DS R1 0528 1k/8k 45.2 163.7 b200-trt trt (4 GPUs, conc: 64) 7.04 2.11
fp4 DS R1 0528 1k/8k 43.0 172.1 b200-trt trt (8 GPUs, conc: 128) 7.39 2.22
fp4 DS R1 0528 1k/8k 44.5 173.8 b200 sglang (4 GPUs, conc: 64) 7.45 2.23
fp4 DS R1 0528 1k/8k 43.8 175.8 b200 sglang (8 GPUs, conc: 128) 7.55 2.26
fp4 DS R1 0528 1k/8k 58.2 127.2 b200-trt trt (8 GPUs, conc: 64) 10.9 3.28
fp4 DS R1 0528 1k/8k 56.3 132.0 b200-trt trt (4 GPUs, conc: 32) 11.2 3.37
fp4 DS R1 0528 1k/8k 59.4 132.2 b200 sglang (4 GPUs, conc: 32) 11.3 3.38
fp4 DS R1 0528 1k/8k 58.0 134.7 b200 sglang (8 GPUs, conc: 64) 11.5 3.46
fp4 DS R1 0528 1k/8k 76.9 96.3 b200 sglang (4 GPUs, conc: 16) 16.5 4.94
fp4 DS R1 0528 1k/8k 72.5 102.6 b200-trt trt (8 GPUs, conc: 32) 17.5 5.25
fp4 DS R1 0528 1k/8k 76.2 104.6 b200 sglang (8 GPUs, conc: 32) 17.8 5.34
fp4 DS R1 0528 1k/8k 70.3 105.1 b200-trt trt (4 GPUs, conc: 16) 18.0 5.39
fp4 DS R1 0528 1k/8k 41.4 178.1 mi355x sglang (8 GPUs, conc: 64) 18.6 5.57
fp4 DS R1 0528 1k/8k 97.8 75.5 b200 sglang (4 GPUs, conc: 8) 25.9 7.78
fp4 DS R1 0528 1k/8k 97.5 75.6 b200-trt trt (4 GPUs, conc: 8) 26.0 7.79
fp4 DS R1 0528 1k/8k 94.3 78.6 b200 sglang (8 GPUs, conc: 16) 26.9 8.08
fp4 DS R1 0528 1k/8k 52.9 139.9 mi355x sglang (8 GPUs, conc: 32) 29.2 8.75
fp4 DS R1 0528 1k/8k 86.3 85.7 b200-trt trt (8 GPUs, conc: 16) 29.3 8.80
fp4 DS R1 0528 1k/8k 116.5 62.6 b200 sglang (4 GPUs, conc: 4) 43.1 12.92
fp4 DS R1 0528 1k/8k 115.9 62.8 b200-trt trt (4 GPUs, conc: 4) 43.2 12.97
fp4 DS R1 0528 1k/8k 114.3 64.7 b200 sglang (8 GPUs, conc: 8) 44.5 13.34
fp4 DS R1 0528 1k/8k 113.0 65.3 b200-trt trt (8 GPUs, conc: 8) 44.9 13.48
fp4 DS R1 0528 1k/8k 64.7 114.8 mi355x sglang (8 GPUs, conc: 16) 47.4 14.21
fp4 DS R1 0528 1k/8k 131.1 55.5 b200-trt trt (8 GPUs, conc: 4) 76.5 22.94
fp4 DS R1 0528 1k/8k 129.3 56.4 b200 sglang (8 GPUs, conc: 4) 77.7 23.31
fp4 DS R1 0528 1k/8k 72.8 101.6 mi355x sglang (8 GPUs, conc: 8) 84.5 25.34
fp4 DS R1 0528 1k/8k 76.7 96.2 mi355x sglang (8 GPUs, conc: 4) 160.6 48.18

[INFO] Filtered out 35 result(s) based on args

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 1k/8k 42.5 176.3 b200 sglang (8 GPUs, conc: 64) 15.0 4.49
fp8 DS R1 0528 1k/8k 31.9 232.2 h200 sglang (8 GPUs, conc: 64) 15.9 4.76
fp8 DS R1 0528 1k/8k 31.2 237.9 h200 trt (8 GPUs, conc: 64) 16.3 4.88
fp8 DS R1 0528 1k/8k 39.1 189.5 b200-trt trt (8 GPUs, conc: 64) 16.3 4.89
fp8 DS R1 0528 1k/8k 24.9 297.6 mi300x sglang (8 GPUs, conc: 64) 21.1 6.34
fp8 DS R1 0528 1k/8k 58.0 127.8 b200 sglang (8 GPUs, conc: 32) 21.7 6.52
fp8 DS R1 0528 1k/8k 33.1 223.1 mi355x sglang (8 GPUs, conc: 64) 23.3 6.98
fp8 DS R1 0528 1k/8k 27.4 270.0 mi325x sglang (8 GPUs, conc: 64) 23.4 7.01
fp8 DS R1 0528 1k/8k 42.0 177.0 h200 sglang (8 GPUs, conc: 32) 24.0 7.20
fp8 DS R1 0528 1k/8k 39.6 187.8 h200 trt (8 GPUs, conc: 32) 25.5 7.66
fp8 DS R1 0528 1k/8k 48.2 154.3 b200-trt trt (8 GPUs, conc: 32) 26.4 7.91
fp8 DS R1 0528 1k/8k 32.1 231.4 mi300x sglang (8 GPUs, conc: 32) 32.6 9.78
fp8 DS R1 0528 1k/8k 75.7 98.3 b200 sglang (8 GPUs, conc: 16) 33.1 9.94
fp8 DS R1 0528 1k/8k 59.5 124.1 h200 sglang (8 GPUs, conc: 16) 33.9 10.17
fp8 DS R1 0528 1k/8k 34.1 217.9 mi325x sglang (8 GPUs, conc: 32) 37.4 11.23
fp8 DS R1 0528 1k/8k 51.2 144.3 h200 trt (8 GPUs, conc: 16) 39.4 11.83
fp8 DS R1 0528 1k/8k 38.4 192.3 mi355x sglang (8 GPUs, conc: 32) 40.1 12.04
fp8 DS R1 0528 1k/8k 57.2 129.2 b200-trt trt (8 GPUs, conc: 16) 44.4 13.31
fp8 DS R1 0528 1k/8k 95.8 77.1 b200 sglang (8 GPUs, conc: 8) 52.4 15.73
fp8 DS R1 0528 1k/8k 75.1 98.0 h200 sglang (8 GPUs, conc: 8) 53.8 16.13
fp8 DS R1 0528 1k/8k 38.0 194.3 mi300x sglang (8 GPUs, conc: 16) 55.1 16.54
fp8 DS R1 0528 1k/8k 40.2 183.8 mi325x sglang (8 GPUs, conc: 16) 63.6 19.08
fp8 DS R1 0528 1k/8k 62.8 117.5 h200 trt (8 GPUs, conc: 8) 64.4 19.31
fp8 DS R1 0528 1k/8k 45.2 164.1 mi355x sglang (8 GPUs, conc: 16) 67.7 20.31
fp8 DS R1 0528 1k/8k 66.3 111.2 b200-trt trt (8 GPUs, conc: 8) 76.7 23.00
fp8 DS R1 0528 1k/8k 115.4 64.0 b200 sglang (8 GPUs, conc: 4) 87.3 26.20
fp8 DS R1 0528 1k/8k 85.4 85.2 h200 sglang (8 GPUs, conc: 4) 93.6 28.07
fp8 DS R1 0528 1k/8k 43.7 168.7 mi300x sglang (8 GPUs, conc: 8) 96.0 28.80
fp8 DS R1 0528 1k/8k 73.1 99.8 h200 trt (8 GPUs, conc: 4) 109.6 32.87
fp8 DS R1 0528 1k/8k 43.0 171.3 mi325x sglang (8 GPUs, conc: 8) 118.7 35.61
fp8 DS R1 0528 1k/8k 50.0 147.7 mi355x sglang (8 GPUs, conc: 8) 122.8 36.84
fp8 DS R1 0528 1k/8k 68.2 106.6 b200-trt trt (8 GPUs, conc: 4) 146.9 44.06
fp8 DS R1 0528 1k/8k 46.4 156.7 mi300x sglang (8 GPUs, conc: 4) 178.3 53.49
fp8 DS R1 0528 1k/8k 45.1 161.0 mi325x sglang (8 GPUs, conc: 4) 223.1 66.94
fp8 DS R1 0528 1k/8k 53.4 137.9 mi355x sglang (8 GPUs, conc: 4) 230.2 69.05

[INFO] Filtered out 31 result(s) based on args

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 8k/1k 22.7 46.1 gb200 dynamo-trt (32 GPUs disagg, conc: 2150, mtp, pfill_dp_attn, dec_dp_attn) 0.46 0.14
fp4 DS R1 0528 8k/1k 19.6 53.2 gb200 dynamo-trt (28 GPUs disagg, conc: 2150, pfill_dp_attn, dec_dp_attn) 0.47 0.14
fp4 DS R1 0528 8k/1k 30.3 34.3 gb200 dynamo-trt (48 GPUs disagg, conc: 2150, pfill_dp_attn, dec_dp_attn) 0.52 0.16
fp4 DS R1 0528 8k/1k 55.4 18.7 gb200 dynamo-trt (48 GPUs disagg, conc: 1075, mtp, pfill_dp_attn, dec_dp_attn) 0.58 0.17
fp4 DS R1 0528 8k/1k 41.6 24.6 gb200 dynamo-trt (40 GPUs disagg, conc: 1075, pfill_dp_attn, dec_dp_attn) 0.63 0.19
fp4 DS R1 0528 8k/1k 15.8 62.2 b200-trt trt (4 GPUs, conc: 256, dp_attn) 0.66 0.20
fp4 DS R1 0528 8k/1k 98.4 10.7 gb200 dynamo-trt (64 GPUs disagg, conc: 538, mtp, pfill_dp_attn, dec_dp_attn) 0.88 0.27
fp4 DS R1 0528 8k/1k 22.8 43.2 b200-trt trt (4 GPUs, conc: 128, dp_attn) 0.93 0.28
fp4 DS R1 0528 8k/1k 22.4 44.7 b200-trt trt (8 GPUs, conc: 256, dp_attn) 0.96 0.29
fp4 DS R1 0528 8k/1k 58.8 17.5 gb200 dynamo-trt (48 GPUs disagg, conc: 538, pfill_dp_attn, dec_dp_attn) 1.08 0.32
fp4 DS R1 0528 8k/1k 113.0 9.3 gb200 dynamo-trt (52 GPUs disagg, conc: 269, mtp, pfill_dp_attn, dec_dp_attn) 1.25 0.37
fp4 DS R1 0528 8k/1k 18.7 59.3 b200 sglang (4 GPUs, conc: 128) 1.25 0.38
fp4 DS R1 0528 8k/1k 30.6 32.0 b200-trt trt (4 GPUs, conc: 64, dp_attn) 1.38 0.41
fp4 DS R1 0528 8k/1k 31.1 32.2 b200-trt trt (8 GPUs, conc: 128, dp_attn) 1.39 0.42
fp4 DS R1 0528 8k/1k 30.3 39.4 b200 sglang (4 GPUs, conc: 64) 1.67 0.50
fp4 DS R1 0528 8k/1k 64.8 15.3 gb200 dynamo-trt (48 GPUs disagg, conc: 256, pfill_dp_attn, dec_dp_attn) 1.95 0.59
fp4 DS R1 0528 8k/1k 41.3 23.1 b200-trt trt (4 GPUs, conc: 32) 1.98 0.59
fp4 DS R1 0528 8k/1k 38.9 25.8 b200-trt trt (8 GPUs, conc: 64, dp_attn) 2.22 0.67
fp4 DS R1 0528 8k/1k 128.5 8.0 gb200 dynamo-trt (52 GPUs disagg, conc: 128, mtp, pfill_dp_attn, dec_dp_attn) 2.25 0.67
fp4 DS R1 0528 8k/1k 45.3 28.3 b200 sglang (4 GPUs, conc: 32) 2.41 0.72
fp4 DS R1 0528 8k/1k 55.2 17.1 b200-trt trt (4 GPUs, conc: 16) 2.95 0.88
fp4 DS R1 0528 8k/1k 52.6 18.1 b200-trt trt (8 GPUs, conc: 32) 3.11 0.93
fp4 DS R1 0528 8k/1k 81.6 18.3 gb200 dynamo-trt (28 GPUs disagg, conc: 102) 3.41 1.02
fp4 DS R1 0528 8k/1k 146.9 10.7 gb200 dynamo-trt (28 GPUs disagg, conc: 54, mtp) 3.77 1.13
fp4 DS R1 0528 8k/1k 62.3 22.0 b200 sglang (4 GPUs, conc: 16) 3.78 1.13
fp4 DS R1 0528 8k/1k 94.0 10.8 gb200 dynamo-trt (28 GPUs disagg, conc: 48) 4.27 1.28
fp4 DS R1 0528 8k/1k 72.0 13.2 b200-trt trt (4 GPUs, conc: 8) 4.54 1.36
fp4 DS R1 0528 8k/1k 70.7 13.3 b200-trt trt (8 GPUs, conc: 16) 4.60 1.38
fp4 DS R1 0528 8k/1k 20.7 45.1 mi355x sglang (8 GPUs, conc: 64) 4.63 1.39
fp4 DS R1 0528 8k/1k 178.4 5.9 gb200 dynamo-trt (28 GPUs disagg, conc: 24, mtp) 4.64 1.39
fp4 DS R1 0528 8k/1k 31.3 30.0 mi355x sglang (8 GPUs, conc: 32) 6.15 1.85
fp4 DS R1 0528 8k/1k 80.3 18.1 b200 sglang (4 GPUs, conc: 8) 6.16 1.85
fp4 DS R1 0528 8k/1k 76.5 19.4 b200 sglang (8 GPUs, conc: 16) 6.63 1.99
fp4 DS R1 0528 8k/1k 98.6 9.7 b200-trt trt (8 GPUs, conc: 8) 6.64 1.99
fp4 DS R1 0528 8k/1k 92.7 10.2 b200-trt trt (4 GPUs, conc: 4) 7.01 2.10
fp4 DS R1 0528 8k/1k 109.3 9.0 gb200 dynamo-trt (28 GPUs disagg, conc: 24) 7.06 2.12
fp4 DS R1 0528 8k/1k 211.4 4.8 gb200 dynamo-trt (28 GPUs disagg, conc: 12, mtp) 7.64 2.29
fp4 DS R1 0528 8k/1k 45.8 20.6 mi355x sglang (8 GPUs, conc: 16) 8.45 2.53
fp4 DS R1 0528 8k/1k 121.8 7.8 b200-trt trt (8 GPUs, conc: 4) 10.7 3.21
fp4 DS R1 0528 8k/1k 87.7 16.1 b200 sglang (4 GPUs, conc: 4) 10.9 3.28
fp4 DS R1 0528 8k/1k 78.9 16.9 b200 sglang (8 GPUs, conc: 8) 11.5 3.45
fp4 DS R1 0528 8k/1k 124.5 7.7 gb200 dynamo-trt (28 GPUs disagg, conc: 12) 12.2 3.66
fp4 DS R1 0528 8k/1k 249.9 4.0 gb200 dynamo-trt (28 GPUs disagg, conc: 6, mtp) 12.6 3.77
fp4 DS R1 0528 8k/1k 59.3 16.1 mi355x sglang (8 GPUs, conc: 8) 13.2 3.97
fp4 DS R1 0528 8k/1k 94.5 15.5 b200 sglang (8 GPUs, conc: 4) 21.1 6.32
fp4 DS R1 0528 8k/1k 137.1 7.1 gb200 dynamo-trt (28 GPUs disagg, conc: 6) 22.2 6.65
fp4 DS R1 0528 8k/1k 278.5 3.6 gb200 dynamo-trt (28 GPUs disagg, conc: 3, mtp) 22.7 6.82
fp4 DS R1 0528 8k/1k 68.6 13.9 mi355x sglang (8 GPUs, conc: 4) 23.0 6.89
fp4 DS R1 0528 8k/1k 150.9 6.4 gb200 dynamo-trt (28 GPUs disagg, conc: 3) 39.4 11.82

[INFO] Filtered out 44 result(s) based on args

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 8k/1k 31.2 178.9 gb200 dynamo-sglang (72 GPUs disagg, conc: 4096, pfill_dp_attn, dec_dp_attn) 2.48 0.74
fp8 DS R1 0528 8k/1k 31.2 93.6 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 2.71 0.81
fp8 DS R1 0528 8k/1k 31.3 55.2 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 3.13 0.94
fp8 DS R1 0528 8k/1k 20.9 48.8 h200 trt (8 GPUs, conc: 64, dp_attn) 3.32 1.00
fp8 DS R1 0528 8k/1k 19.5 49.6 h200 sglang (8 GPUs, conc: 64) 3.39 1.02
fp8 DS R1 0528 8k/1k 23.9 39.9 b200-trt trt (8 GPUs, conc: 64) 3.39 1.02
fp8 DS R1 0528 8k/1k 22.3 44.5 b200 sglang (8 GPUs, conc: 64) 3.74 1.12
fp8 DS R1 0528 8k/1k 31.3 39.2 gb200 dynamo-sglang (72 GPUs disagg, conc: 576, pfill_dp_attn, dec_dp_attn) 3.78 1.13
fp8 DS R1 0528 8k/1k 31.5 38.4 gb200 dynamo-sglang (72 GPUs disagg, conc: 512, pfill_dp_attn, dec_dp_attn) 4.03 1.21
fp8 DS R1 0528 8k/1k 29.2 32.7 h200 sglang (8 GPUs, conc: 32) 4.47 1.34
fp8 DS R1 0528 8k/1k 31.5 36.3 gb200 dynamo-sglang (72 GPUs disagg, conc: 448, pfill_dp_attn, dec_dp_attn) 4.57 1.37
fp8 DS R1 0528 8k/1k 31.6 34.3 gb200 dynamo-sglang (72 GPUs disagg, conc: 384, pfill_dp_attn, dec_dp_attn) 4.62 1.39
fp8 DS R1 0528 8k/1k 27.9 34.2 h200 trt (8 GPUs, conc: 32) 4.68 1.40
fp8 DS R1 0528 8k/1k 36.0 28.1 b200 sglang (8 GPUs, conc: 32) 4.72 1.42
fp8 DS R1 0528 8k/1k 34.5 27.7 b200-trt trt (8 GPUs, conc: 32) 4.76 1.43
fp8 DS R1 0528 8k/1k 13.5 71.1 mi300x sglang (8 GPUs, conc: 64) 4.98 1.49
fp8 DS R1 0528 8k/1k 18.6 50.3 mi355x sglang (8 GPUs, conc: 64) 5.17 1.55
fp8 DS R1 0528 8k/1k 14.8 65.0 mi325x sglang (8 GPUs, conc: 64) 5.54 1.66
fp8 DS R1 0528 8k/1k 45.5 20.8 h200 sglang (8 GPUs, conc: 16) 5.73 1.72
fp8 DS R1 0528 8k/1k 31.3 31.0 gb200 dynamo-sglang (72 GPUs disagg, conc: 256, pfill_dp_attn, dec_dp_attn) 6.20 1.86
fp8 DS R1 0528 8k/1k 54.1 19.1 b200 sglang (8 GPUs, conc: 16) 6.42 1.93
fp8 DS R1 0528 8k/1k 39.7 23.8 h200 trt (8 GPUs, conc: 16) 6.54 1.96
fp8 DS R1 0528 8k/1k 20.5 47.1 mi300x sglang (8 GPUs, conc: 32) 6.67 2.00
fp8 DS R1 0528 8k/1k 46.1 20.5 b200-trt trt (8 GPUs, conc: 16) 7.10 2.13
fp8 DS R1 0528 8k/1k 26.7 34.9 mi355x sglang (8 GPUs, conc: 32) 7.18 2.15
fp8 DS R1 0528 8k/1k 22.7 42.4 mi325x sglang (8 GPUs, conc: 32) 7.33 2.20
fp8 DS R1 0528 8k/1k 64.5 14.9 h200 sglang (8 GPUs, conc: 8) 8.14 2.44
fp8 DS R1 0528 8k/1k 29.5 32.4 mi300x sglang (8 GPUs, conc: 16) 9.27 2.78
fp8 DS R1 0528 8k/1k 75.2 14.1 b200 sglang (8 GPUs, conc: 8) 9.52 2.85
fp8 DS R1 0528 8k/1k 53.8 17.8 h200 trt (8 GPUs, conc: 8) 9.71 2.91
fp8 DS R1 0528 8k/1k 31.4 30.3 mi325x sglang (8 GPUs, conc: 16) 10.6 3.18
fp8 DS R1 0528 8k/1k 35.9 26.1 mi355x sglang (8 GPUs, conc: 16) 10.7 3.21
fp8 DS R1 0528 8k/1k 57.8 16.5 b200-trt trt (8 GPUs, conc: 8) 11.3 3.40
fp8 DS R1 0528 8k/1k 30.9 32.2 gb200 dynamo-sglang (72 GPUs disagg, conc: 128, pfill_dp_attn, dec_dp_attn) 12.9 3.88
fp8 DS R1 0528 8k/1k 101.3 9.6 b200 sglang (8 GPUs, conc: 4) 12.9 3.88
fp8 DS R1 0528 8k/1k 77.7 12.2 h200 sglang (8 GPUs, conc: 4) 13.4 4.03
fp8 DS R1 0528 8k/1k 37.7 25.5 mi300x sglang (8 GPUs, conc: 8) 14.4 4.33
fp8 DS R1 0528 8k/1k 66.9 14.2 h200 trt (8 GPUs, conc: 4) 15.6 4.67
fp8 DS R1 0528 8k/1k 37.9 25.2 mi325x sglang (8 GPUs, conc: 8) 17.4 5.22
fp8 DS R1 0528 8k/1k 43.7 21.6 mi355x sglang (8 GPUs, conc: 8) 17.7 5.32
fp8 DS R1 0528 8k/1k 64.3 14.7 b200-trt trt (8 GPUs, conc: 4) 20.2 6.06
fp8 DS R1 0528 8k/1k 43.4 21.9 mi300x sglang (8 GPUs, conc: 4) 24.9 7.47
fp8 DS R1 0528 8k/1k 42.5 22.3 mi325x sglang (8 GPUs, conc: 4) 30.8 9.25
fp8 DS R1 0528 8k/1k 50.0 18.7 mi355x sglang (8 GPUs, conc: 4) 30.9 9.28

[INFO] Filtered out 49 result(s) based on args

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 1k/1k 78.5 11.8 b200-trt trt (1 GPUs, conc: 128) 0.06 0.02
fp4 GPT-OSS 120B 1k/1k 152.9 6.2 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.07 0.02
fp4 GPT-OSS 120B 1k/1k 69.5 13.4 b200 vllm (1 GPUs, conc: 128) 0.07 0.02
fp4 GPT-OSS 120B 1k/1k 105.7 8.8 b200 vllm (2 GPUs, conc: 128) 0.09 0.03
fp4 GPT-OSS 120B 1k/1k 62.7 14.9 mi355x vllm (1 GPUs, conc: 128) 0.10 0.03
fp4 GPT-OSS 120B 1k/1k 101.5 9.2 b200-trt trt (1 GPUs, conc: 64) 0.10 0.03
fp4 GPT-OSS 120B 1k/1k 208.3 4.5 b200-trt trt (2 GPUs, conc: 64, dp_attn) 0.10 0.03
fp4 GPT-OSS 120B 1k/1k 94.1 9.9 b200 vllm (1 GPUs, conc: 64) 0.11 0.03
fp4 GPT-OSS 120B 1k/1k 85.0 11.0 mi355x vllm (1 GPUs, conc: 64) 0.14 0.04
fp4 GPT-OSS 120B 1k/1k 139.6 6.7 b200 vllm (2 GPUs, conc: 64) 0.14 0.04
fp4 GPT-OSS 120B 1k/1k 255.7 3.7 b200-trt trt (4 GPUs, conc: 64, dp_attn) 0.16 0.05
fp4 GPT-OSS 120B 1k/1k 48.9 19.1 h200 trt (1 GPUs, conc: 64) 0.16 0.05
fp4 GPT-OSS 120B 1k/1k 122.8 7.6 b200 vllm (1 GPUs, conc: 32) 0.16 0.05
fp4 GPT-OSS 120B 1k/1k 241.3 3.9 b200-trt trt (2 GPUs, conc: 32, dp_attn) 0.17 0.05
fp4 GPT-OSS 120B 1k/1k 189.9 4.9 b200-trt trt (2 GPUs, conc: 32) 0.21 0.06
fp4 GPT-OSS 120B 1k/1k 188.5 5.0 b200-trt trt (4 GPUs, conc: 64) 0.21 0.06
fp4 GPT-OSS 120B 1k/1k 184.9 5.0 b200 vllm (4 GPUs, conc: 64) 0.22 0.07
fp4 GPT-OSS 120B 1k/1k 110.2 8.5 mi355x vllm (1 GPUs, conc: 32) 0.22 0.07
fp4 GPT-OSS 120B 1k/1k 70.8 13.2 h200 trt (2 GPUs, conc: 64) 0.23 0.07
fp4 GPT-OSS 120B 1k/1k 173.5 5.4 b200 vllm (2 GPUs, conc: 32) 0.23 0.07
fp4 GPT-OSS 120B 1k/1k 43.3 21.5 mi325x vllm (1 GPUs, conc: 64) 0.23 0.07
fp4 GPT-OSS 120B 1k/1k 67.4 13.8 h200 vllm (2 GPUs, conc: 64) 0.24 0.07
fp4 GPT-OSS 120B 1k/1k 34.6 26.9 mi300x vllm (1 GPUs, conc: 64) 0.24 0.07
fp4 GPT-OSS 120B 1k/1k 166.1 5.6 b200 vllm (1 GPUs, conc: 16) 0.24 0.07
fp4 GPT-OSS 120B 1k/1k 65.5 14.2 h200 trt (1 GPUs, conc: 32) 0.24 0.07
fp4 GPT-OSS 120B 1k/1k 65.2 14.3 h100 vllm (2 GPUs, conc: 64) 0.24 0.07
fp4 GPT-OSS 120B 1k/1k 289.4 3.3 b200-trt trt (4 GPUs, conc: 32, dp_attn) 0.29 0.09
fp4 GPT-OSS 120B 1k/1k 57.6 16.2 mi300x vllm (2 GPUs, conc: 64) 0.29 0.09
fp4 GPT-OSS 120B 1k/1k 103.5 9.0 h200 vllm (4 GPUs, conc: 64) 0.31 0.09
fp4 GPT-OSS 120B 1k/1k 100.7 9.3 h100 vllm (4 GPUs, conc: 64) 0.32 0.09
fp4 GPT-OSS 120B 1k/1k 241.3 3.8 b200-trt trt (2 GPUs, conc: 16) 0.33 0.10
fp4 GPT-OSS 120B 1k/1k 60.6 15.5 mi325x vllm (2 GPUs, conc: 64) 0.33 0.10
fp4 GPT-OSS 120B 1k/1k 145.9 6.4 mi355x vllm (1 GPUs, conc: 16) 0.33 0.10
fp4 GPT-OSS 120B 1k/1k 239.9 3.9 b200-trt trt (4 GPUs, conc: 32) 0.33 0.10
fp4 GPT-OSS 120B 1k/1k 92.1 10.1 h200 vllm (2 GPUs, conc: 32) 0.35 0.10
fp4 GPT-OSS 120B 1k/1k 90.3 10.4 h200 trt (2 GPUs, conc: 32) 0.35 0.11
fp4 GPT-OSS 120B 1k/1k 89.6 10.4 h100 vllm (2 GPUs, conc: 32) 0.36 0.11
fp4 GPT-OSS 120B 1k/1k 87.4 10.6 h200 trt (1 GPUs, conc: 16) 0.36 0.11
fp4 GPT-OSS 120B 1k/1k 55.8 16.8 mi325x vllm (1 GPUs, conc: 32) 0.36 0.11
fp4 GPT-OSS 120B 1k/1k 221.3 4.2 b200 vllm (4 GPUs, conc: 32) 0.36 0.11
fp4 GPT-OSS 120B 1k/1k 210.8 4.4 b200 vllm (2 GPUs, conc: 16) 0.38 0.11
fp4 GPT-OSS 120B 1k/1k 212.1 4.4 b200 vllm (1 GPUs, conc: 8) 0.38 0.11
fp4 GPT-OSS 120B 1k/1k 75.6 12.6 mi300x vllm (2 GPUs, conc: 32) 0.44 0.13
fp4 GPT-OSS 120B 1k/1k 89.9 10.4 mi325x vllm (2 GPUs, conc: 32) 0.45 0.13
fp4 GPT-OSS 120B 1k/1k 72.6 12.9 mi300x vllm (4 GPUs, conc: 64) 0.46 0.14
fp4 GPT-OSS 120B 1k/1k 140.0 6.7 h200 vllm (8 GPUs, conc: 64) 0.46 0.14
fp4 GPT-OSS 120B 1k/1k 137.5 6.8 h200 vllm (4 GPUs, conc: 32) 0.46 0.14
fp4 GPT-OSS 120B 1k/1k 87.5 10.8 mi325x vllm (4 GPUs, conc: 64) 0.47 0.14
fp4 GPT-OSS 120B 1k/1k 135.3 6.9 h100 vllm (8 GPUs, conc: 64) 0.47 0.14
fp4 GPT-OSS 120B 1k/1k 131.6 7.1 h100 vllm (4 GPUs, conc: 32) 0.48 0.15
fp4 GPT-OSS 120B 1k/1k 82.1 11.4 mi325x vllm (1 GPUs, conc: 16) 0.49 0.15
fp4 GPT-OSS 120B 1k/1k 314.6 2.9 b200-trt trt (4 GPUs, conc: 16) 0.50 0.15
fp4 GPT-OSS 120B 1k/1k 124.8 7.4 h200 vllm (2 GPUs, conc: 16) 0.51 0.15
fp4 GPT-OSS 120B 1k/1k 310.0 3.0 b200-trt trt (2 GPUs, conc: 8) 0.52 0.15
fp4 GPT-OSS 120B 1k/1k 189.3 5.0 mi355x vllm (1 GPUs, conc: 8) 0.52 0.16
fp4 GPT-OSS 120B 1k/1k 119.7 7.8 h100 vllm (2 GPUs, conc: 16) 0.53 0.16
fp4 GPT-OSS 120B 1k/1k 118.1 7.9 h200 trt (1 GPUs, conc: 8) 0.54 0.16
fp4 GPT-OSS 120B 1k/1k 114.9 8.1 h200 trt (2 GPUs, conc: 16) 0.55 0.16
fp4 GPT-OSS 120B 1k/1k 108.7 8.6 h200 trt (4 GPUs, conc: 32) 0.59 0.18
fp4 GPT-OSS 120B 1k/1k 265.8 3.5 b200 vllm (4 GPUs, conc: 16) 0.60 0.18
fp4 GPT-OSS 120B 1k/1k 267.2 3.5 b200 vllm (1 GPUs, conc: 4) 0.60 0.18
fp4 GPT-OSS 120B 1k/1k 258.7 3.6 b200 vllm (2 GPUs, conc: 8) 0.62 0.19
fp4 GPT-OSS 120B 1k/1k 102.0 9.1 mi300x vllm (2 GPUs, conc: 16) 0.64 0.19
fp4 GPT-OSS 120B 1k/1k 104.0 9.1 mi300x vllm (4 GPUs, conc: 32) 0.64 0.19
fp4 GPT-OSS 120B 1k/1k 117.8 7.9 mi325x vllm (2 GPUs, conc: 16) 0.68 0.20
fp4 GPT-OSS 120B 1k/1k 176.1 5.3 h200 vllm (4 GPUs, conc: 16) 0.72 0.22
fp4 GPT-OSS 120B 1k/1k 110.5 8.5 mi325x vllm (1 GPUs, conc: 8) 0.73 0.22
fp4 GPT-OSS 120B 1k/1k 173.6 5.4 h200 vllm (8 GPUs, conc: 32) 0.74 0.22
fp4 GPT-OSS 120B 1k/1k 170.1 5.4 h100 vllm (4 GPUs, conc: 16) 0.74 0.22
fp4 GPT-OSS 120B 1k/1k 108.0 8.6 mi325x vllm (4 GPUs, conc: 32) 0.74 0.22
fp4 GPT-OSS 120B 1k/1k 170.8 5.5 h100 vllm (8 GPUs, conc: 32) 0.75 0.22
fp4 GPT-OSS 120B 1k/1k 105.6 8.9 mi325x vllm (8 GPUs, conc: 64) 0.77 0.23
fp4 GPT-OSS 120B 1k/1k 161.9 5.8 h200 vllm (2 GPUs, conc: 8) 0.79 0.24
fp4 GPT-OSS 120B 1k/1k 158.6 5.9 h100 vllm (2 GPUs, conc: 8) 0.80 0.24
fp4 GPT-OSS 120B 1k/1k 156.1 6.0 h200 vllm (1 GPUs, conc: 4) 0.82 0.24
fp4 GPT-OSS 120B 1k/1k 238.5 3.9 mi355x vllm (1 GPUs, conc: 4) 0.83 0.25
fp4 GPT-OSS 120B 1k/1k 373.4 2.5 b200-trt trt (2 GPUs, conc: 4) 0.86 0.26
fp4 GPT-OSS 120B 1k/1k 146.3 6.4 h200 trt (2 GPUs, conc: 8) 0.87 0.26
fp4 GPT-OSS 120B 1k/1k 145.3 6.4 h200 trt (1 GPUs, conc: 4) 0.88 0.26
fp4 GPT-OSS 120B 1k/1k 366.9 2.6 b200-trt trt (4 GPUs, conc: 8) 0.88 0.26
fp4 GPT-OSS 120B 1k/1k 138.5 6.7 h200 trt (4 GPUs, conc: 16) 0.91 0.27
fp4 GPT-OSS 120B 1k/1k 133.7 7.0 mi300x vllm (2 GPUs, conc: 8) 0.99 0.30
fp4 GPT-OSS 120B 1k/1k 302.0 3.1 b200 vllm (4 GPUs, conc: 8) 1.06 0.32
fp4 GPT-OSS 120B 1k/1k 151.6 6.4 mi325x vllm (2 GPUs, conc: 8) 1.09 0.33
fp4 GPT-OSS 120B 1k/1k 289.6 3.2 b200 vllm (2 GPUs, conc: 4) 1.10 0.33
fp4 GPT-OSS 120B 1k/1k 115.3 8.0 mi300x vllm (4 GPUs, conc: 16) 1.13 0.34
fp4 GPT-OSS 120B 1k/1k 143.6 6.6 mi325x vllm (8 GPUs, conc: 32) 1.15 0.34
fp4 GPT-OSS 120B 1k/1k 138.7 6.7 mi325x vllm (1 GPUs, conc: 4) 1.16 0.35
fp4 GPT-OSS 120B 1k/1k 215.7 4.4 h200 vllm (4 GPUs, conc: 8) 1.18 0.35
fp4 GPT-OSS 120B 1k/1k 212.8 4.3 h200 vllm (8 GPUs, conc: 16) 1.19 0.36
fp4 GPT-OSS 120B 1k/1k 208.7 4.4 h100 vllm (8 GPUs, conc: 16) 1.21 0.36
fp4 GPT-OSS 120B 1k/1k 208.8 4.5 h100 vllm (4 GPUs, conc: 8) 1.22 0.37
fp4 GPT-OSS 120B 1k/1k 202.0 4.6 h200 vllm (2 GPUs, conc: 4) 1.26 0.38
fp4 GPT-OSS 120B 1k/1k 125.9 7.5 mi325x vllm (4 GPUs, conc: 16) 1.29 0.39
fp4 GPT-OSS 120B 1k/1k 191.3 4.9 h100 vllm (2 GPUs, conc: 4) 1.33 0.40
fp4 GPT-OSS 120B 1k/1k 169.5 5.5 h200 trt (2 GPUs, conc: 4) 1.51 0.45
fp4 GPT-OSS 120B 1k/1k 424.3 2.2 b200-trt trt (4 GPUs, conc: 4) 1.51 0.45
fp4 GPT-OSS 120B 1k/1k 165.5 5.7 h200 trt (4 GPUs, conc: 8) 1.55 0.46
fp4 GPT-OSS 120B 1k/1k 170.6 5.7 mi300x vllm (4 GPUs, conc: 8) 1.59 0.48
fp4 GPT-OSS 120B 1k/1k 243.2 3.9 mi355x vllm (8 GPUs, conc: 16) 1.62 0.48
fp4 GPT-OSS 120B 1k/1k 162.8 5.7 mi300x vllm (2 GPUs, conc: 4) 1.62 0.49
fp4 GPT-OSS 120B 1k/1k 241.1 3.9 mi355x vllm (4 GPUs, conc: 8) 1.64 0.49
fp4 GPT-OSS 120B 1k/1k 388.6 2.4 b200-trt trt (8 GPUs, conc: 8) 1.66 0.50
fp4 GPT-OSS 120B 1k/1k 182.3 5.1 mi325x vllm (8 GPUs, conc: 16) 1.75 0.52
fp4 GPT-OSS 120B 1k/1k 153.6 6.2 mi300x vllm (8 GPUs, conc: 16) 1.75 0.53
fp4 GPT-OSS 120B 1k/1k 178.6 5.2 mi325x vllm (2 GPUs, conc: 4) 1.80 0.54
fp4 GPT-OSS 120B 1k/1k 333.3 2.8 b200 vllm (4 GPUs, conc: 4) 1.92 0.58
fp4 GPT-OSS 120B 1k/1k 161.9 5.8 mi325x vllm (4 GPUs, conc: 8) 1.98 0.60
fp4 GPT-OSS 120B 1k/1k 252.3 3.7 h100 vllm (8 GPUs, conc: 8) 2.02 0.61
fp4 GPT-OSS 120B 1k/1k 250.6 3.7 h200 vllm (8 GPUs, conc: 8) 2.03 0.61
fp4 GPT-OSS 120B 1k/1k 249.0 3.7 h200 vllm (4 GPUs, conc: 4) 2.05 0.61
fp4 GPT-OSS 120B 1k/1k 246.1 3.8 h100 vllm (4 GPUs, conc: 4) 2.07 0.62
fp4 GPT-OSS 120B 1k/1k 307.7 3.0 b200 vllm (8 GPUs, conc: 8) 2.08 0.62
fp4 GPT-OSS 120B 1k/1k 202.3 4.6 h200 trt (4 GPUs, conc: 4) 2.53 0.76
fp4 GPT-OSS 120B 1k/1k 196.4 4.8 mi300x vllm (8 GPUs, conc: 8) 2.69 0.81
fp4 GPT-OSS 120B 1k/1k 461.7 2.0 b200-trt trt (8 GPUs, conc: 4) 2.79 0.84
fp4 GPT-OSS 120B 1k/1k 181.6 5.2 h200 trt (8 GPUs, conc: 8) 2.83 0.85
fp4 GPT-OSS 120B 1k/1k 268.4 3.5 mi355x vllm (8 GPUs, conc: 8) 2.95 0.88
fp4 GPT-OSS 120B 1k/1k 257.9 3.6 mi355x vllm (4 GPUs, conc: 4) 3.06 0.92
fp4 GPT-OSS 120B 1k/1k 169.7 5.5 mi300x vllm (4 GPUs, conc: 4) 3.11 0.93
fp4 GPT-OSS 120B 1k/1k 204.9 4.6 mi325x vllm (8 GPUs, conc: 8) 3.18 0.95
fp4 GPT-OSS 120B 1k/1k 177.6 5.2 mi325x vllm (4 GPUs, conc: 4) 3.62 1.09
fp4 GPT-OSS 120B 1k/1k 280.4 3.3 h200 vllm (8 GPUs, conc: 4) 3.63 1.09
fp4 GPT-OSS 120B 1k/1k 277.7 3.3 h100 vllm (8 GPUs, conc: 4) 3.67 1.10
fp4 GPT-OSS 120B 1k/1k 340.1 2.7 b200 vllm (8 GPUs, conc: 4) 3.77 1.13
fp4 GPT-OSS 120B 1k/1k 232.3 4.0 h200 trt (8 GPUs, conc: 4) 4.44 1.33
fp4 GPT-OSS 120B 1k/1k 205.5 4.6 mi300x vllm (8 GPUs, conc: 4) 5.28 1.58
fp4 GPT-OSS 120B 1k/1k 225.2 4.1 mi325x vllm (8 GPUs, conc: 4) 5.71 1.71
fp4 GPT-OSS 120B 1k/1k 274.9 3.4 mi355x vllm (8 GPUs, conc: 4) 5.74 1.72
Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 1k/8k 80.5 91.6 b200-trt trt (1 GPUs, conc: 128) 0.49 0.15
fp4 GPT-OSS 120B 1k/8k 72.3 102.0 b200 vllm (1 GPUs, conc: 128) 0.55 0.16
fp4 GPT-OSS 120B 1k/8k 144.9 51.1 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.55 0.17
fp4 GPT-OSS 120B 1k/8k 129.4 57.0 b200-trt trt (2 GPUs, conc: 128) 0.61 0.18
fp4 GPT-OSS 120B 1k/8k 111.1 66.6 b200-trt trt (1 GPUs, conc: 64) 0.71 0.21
fp4 GPT-OSS 120B 1k/8k 109.9 67.1 b200 vllm (2 GPUs, conc: 128) 0.72 0.22
fp4 GPT-OSS 120B 1k/8k 102.6 72.1 b200 vllm (1 GPUs, conc: 64) 0.77 0.23
fp4 GPT-OSS 120B 1k/8k 205.4 36.1 b200-trt trt (2 GPUs, conc: 64, dp_attn) 0.78 0.23
fp4 GPT-OSS 120B 1k/8k 60.5 122.0 mi355x vllm (1 GPUs, conc: 128) 0.80 0.24
fp4 GPT-OSS 120B 1k/8k 178.2 41.4 b200-trt trt (4 GPUs, conc: 128) 0.89 0.27
fp4 GPT-OSS 120B 1k/8k 165.0 44.8 b200-trt trt (2 GPUs, conc: 64) 0.96 0.29
fp4 GPT-OSS 120B 1k/8k 146.8 50.4 b200 vllm (2 GPUs, conc: 64) 1.08 0.33
fp4 GPT-OSS 120B 1k/8k 86.4 85.6 mi355x vllm (1 GPUs, conc: 64) 1.12 0.34
fp4 GPT-OSS 120B 1k/8k 132.9 55.9 b200 vllm (1 GPUs, conc: 32) 1.19 0.36
fp4 GPT-OSS 120B 1k/8k 54.7 160.9 h200 trt (1 GPUs, conc: 64) 1.38 0.41
fp4 GPT-OSS 120B 1k/8k 230.5 32.1 b200-trt trt (4 GPUs, conc: 64) 1.38 0.41
fp4 GPT-OSS 120B 1k/8k 214.1 34.7 b200-trt trt (2 GPUs, conc: 32) 1.48 0.44
fp4 GPT-OSS 120B 1k/8k 79.8 92.6 h200 trt (2 GPUs, conc: 64) 1.58 0.47
fp4 GPT-OSS 120B 1k/8k 197.7 37.4 b200 vllm (4 GPUs, conc: 64) 1.61 0.48
fp4 GPT-OSS 120B 1k/8k 114.1 65.1 mi355x vllm (1 GPUs, conc: 32) 1.69 0.51
fp4 GPT-OSS 120B 1k/8k 73.7 100.7 h200 trt (1 GPUs, conc: 32) 1.71 0.51
fp4 GPT-OSS 120B 1k/8k 183.7 40.4 b200 vllm (2 GPUs, conc: 32) 1.73 0.52
fp4 GPT-OSS 120B 1k/8k 45.5 162.6 mi325x vllm (1 GPUs, conc: 64) 1.74 0.52
fp4 GPT-OSS 120B 1k/8k 37.1 199.6 mi300x vllm (1 GPUs, conc: 64) 1.75 0.53
fp4 GPT-OSS 120B 1k/8k 71.6 103.3 h200 vllm (2 GPUs, conc: 64) 1.76 0.53
fp4 GPT-OSS 120B 1k/8k 179.1 41.2 b200 vllm (1 GPUs, conc: 16) 1.76 0.53
fp4 GPT-OSS 120B 1k/8k 67.3 109.9 h100 vllm (2 GPUs, conc: 64) 1.87 0.56
fp4 GPT-OSS 120B 1k/8k 61.7 120.1 mi300x vllm (2 GPUs, conc: 64) 2.12 0.64
fp4 GPT-OSS 120B 1k/8k 71.7 103.1 mi325x vllm (2 GPUs, conc: 64) 2.22 0.66
fp4 GPT-OSS 120B 1k/8k 280.8 26.5 b200-trt trt (4 GPUs, conc: 32) 2.26 0.68
fp4 GPT-OSS 120B 1k/8k 107.6 68.8 h200 vllm (4 GPUs, conc: 64) 2.35 0.70
fp4 GPT-OSS 120B 1k/8k 107.1 69.1 h200 trt (4 GPUs, conc: 64) 2.37 0.71
fp4 GPT-OSS 120B 1k/8k 263.1 28.1 b200-trt trt (2 GPUs, conc: 16) 2.41 0.72
fp4 GPT-OSS 120B 1k/8k 103.4 71.5 h100 vllm (4 GPUs, conc: 64) 2.44 0.73
fp4 GPT-OSS 120B 1k/8k 101.6 73.2 h200 trt (2 GPUs, conc: 32) 2.49 0.75
fp4 GPT-OSS 120B 1k/8k 153.4 48.2 mi355x vllm (1 GPUs, conc: 16) 2.51 0.75
fp4 GPT-OSS 120B 1k/8k 98.0 75.8 h200 vllm (2 GPUs, conc: 32) 2.58 0.77
fp4 GPT-OSS 120B 1k/8k 94.4 78.7 h100 vllm (2 GPUs, conc: 32) 2.67 0.80
fp4 GPT-OSS 120B 1k/8k 91.9 80.4 h200 trt (1 GPUs, conc: 16) 2.74 0.82
fp4 GPT-OSS 120B 1k/8k 231.6 32.1 b200 vllm (4 GPUs, conc: 32) 2.74 0.82
fp4 GPT-OSS 120B 1k/8k 228.3 32.4 b200 vllm (2 GPUs, conc: 16) 2.78 0.83
fp4 GPT-OSS 120B 1k/8k 219.0 33.7 b200 vllm (1 GPUs, conc: 8) 2.89 0.87
fp4 GPT-OSS 120B 1k/8k 85.5 86.9 mi300x vllm (2 GPUs, conc: 32) 3.05 0.91
fp4 GPT-OSS 120B 1k/8k 80.1 92.3 mi300x vllm (4 GPUs, conc: 64) 3.27 0.98
fp4 GPT-OSS 120B 1k/8k 95.1 78.1 mi325x vllm (2 GPUs, conc: 32) 3.35 1.00
fp4 GPT-OSS 120B 1k/8k 94.9 77.9 mi325x vllm (4 GPUs, conc: 64) 3.37 1.01
fp4 GPT-OSS 120B 1k/8k 146.7 50.4 h200 vllm (8 GPUs, conc: 64) 3.45 1.04
fp4 GPT-OSS 120B 1k/8k 144.5 51.4 h200 vllm (4 GPUs, conc: 32) 3.50 1.05
fp4 GPT-OSS 120B 1k/8k 140.5 52.6 h100 vllm (8 GPUs, conc: 64) 3.60 1.08
fp4 GPT-OSS 120B 1k/8k 138.1 53.8 h100 vllm (4 GPUs, conc: 32) 3.66 1.10
fp4 GPT-OSS 120B 1k/8k 335.7 22.0 b200-trt trt (4 GPUs, conc: 16) 3.78 1.13
fp4 GPT-OSS 120B 1k/8k 130.7 56.9 h200 trt (4 GPUs, conc: 32) 3.87 1.16
fp4 GPT-OSS 120B 1k/8k 127.9 57.8 h200 vllm (2 GPUs, conc: 16) 3.94 1.18
fp4 GPT-OSS 120B 1k/8k 321.7 22.9 b200-trt trt (2 GPUs, conc: 8) 3.94 1.18
fp4 GPT-OSS 120B 1k/8k 189.7 38.9 mi355x vllm (1 GPUs, conc: 8) 4.07 1.22
fp4 GPT-OSS 120B 1k/8k 123.7 59.7 h100 vllm (2 GPUs, conc: 16) 4.07 1.22
fp4 GPT-OSS 120B 1k/8k 122.7 60.2 h200 trt (2 GPUs, conc: 16) 4.11 1.23
fp4 GPT-OSS 120B 1k/8k 121.5 60.7 h200 trt (1 GPUs, conc: 8) 4.14 1.24
fp4 GPT-OSS 120B 1k/8k 275.5 26.8 b200 vllm (4 GPUs, conc: 16) 4.61 1.38
fp4 GPT-OSS 120B 1k/8k 267.9 27.2 b200 vllm (1 GPUs, conc: 4) 4.67 1.40
fp4 GPT-OSS 120B 1k/8k 109.7 67.4 mi300x vllm (2 GPUs, conc: 16) 4.75 1.42
fp4 GPT-OSS 120B 1k/8k 263.8 27.9 b200 vllm (2 GPUs, conc: 8) 4.81 1.44
fp4 GPT-OSS 120B 1k/8k 130.4 56.9 mi325x vllm (8 GPUs, conc: 64) 4.92 1.48
fp4 GPT-OSS 120B 1k/8k 123.1 60.0 mi325x vllm (2 GPUs, conc: 16) 5.15 1.55
fp4 GPT-OSS 120B 1k/8k 100.6 73.8 mi300x vllm (4 GPUs, conc: 32) 5.20 1.56
fp4 GPT-OSS 120B 1k/8k 187.2 39.7 h200 vllm (8 GPUs, conc: 32) 5.41 1.62
fp4 GPT-OSS 120B 1k/8k 184.3 40.1 h200 vllm (4 GPUs, conc: 16) 5.48 1.64
fp4 GPT-OSS 120B 1k/8k 180.0 41.3 h100 vllm (8 GPUs, conc: 32) 5.62 1.69
fp4 GPT-OSS 120B 1k/8k 177.1 41.7 h100 vllm (4 GPUs, conc: 16) 5.70 1.71
fp4 GPT-OSS 120B 1k/8k 168.8 43.7 h200 vllm (2 GPUs, conc: 8) 5.97 1.79
fp4 GPT-OSS 120B 1k/8k 161.1 45.8 h100 vllm (2 GPUs, conc: 8) 6.26 1.88
fp4 GPT-OSS 120B 1k/8k 158.8 45.9 h200 vllm (1 GPUs, conc: 4) 6.29 1.89
fp4 GPT-OSS 120B 1k/8k 238.7 30.5 mi355x vllm (1 GPUs, conc: 4) 6.42 1.92
fp4 GPT-OSS 120B 1k/8k 389.1 18.9 b200-trt trt (4 GPUs, conc: 8) 6.52 1.96
fp4 GPT-OSS 120B 1k/8k 379.4 19.2 b200-trt trt (2 GPUs, conc: 4) 6.61 1.98
fp4 GPT-OSS 120B 1k/8k 152.3 48.4 h200 trt (2 GPUs, conc: 8) 6.62 1.99
fp4 GPT-OSS 120B 1k/8k 147.4 49.4 h200 trt (1 GPUs, conc: 4) 6.77 2.03
fp4 GPT-OSS 120B 1k/8k 147.8 50.0 h200 trt (4 GPUs, conc: 16) 6.83 2.05
fp4 GPT-OSS 120B 1k/8k 363.4 20.4 b200-trt trt (8 GPUs, conc: 16) 6.99 2.10
fp4 GPT-OSS 120B 1k/8k 148.4 49.9 mi300x vllm (4 GPUs, conc: 16) 7.06 2.12
fp4 GPT-OSS 120B 1k/8k 140.4 52.8 mi300x vllm (2 GPUs, conc: 8) 7.48 2.24
fp4 GPT-OSS 120B 1k/8k 162.0 45.9 mi325x vllm (8 GPUs, conc: 32) 7.87 2.36
fp4 GPT-OSS 120B 1k/8k 157.0 47.0 mi325x vllm (2 GPUs, conc: 8) 8.11 2.43
fp4 GPT-OSS 120B 1k/8k 302.0 24.4 b200 vllm (4 GPUs, conc: 8) 8.40 2.52
fp4 GPT-OSS 120B 1k/8k 286.9 25.3 b200 vllm (2 GPUs, conc: 4) 8.73 2.62
fp4 GPT-OSS 120B 1k/8k 223.0 33.0 h200 vllm (4 GPUs, conc: 8) 9.05 2.72
fp4 GPT-OSS 120B 1k/8k 222.2 33.2 h200 vllm (8 GPUs, conc: 16) 9.09 2.73
fp4 GPT-OSS 120B 1k/8k 216.8 34.1 h100 vllm (8 GPUs, conc: 16) 9.32 2.80
fp4 GPT-OSS 120B 1k/8k 214.2 34.4 h100 vllm (4 GPUs, conc: 8) 9.43 2.83
fp4 GPT-OSS 120B 1k/8k 209.6 34.7 h200 vllm (2 GPUs, conc: 4) 9.53 2.86
fp4 GPT-OSS 120B 1k/8k 198.7 36.6 h100 vllm (2 GPUs, conc: 4) 10.0 3.01
fp4 GPT-OSS 120B 1k/8k 174.9 41.6 h200 trt (2 GPUs, conc: 4) 11.4 3.42
fp4 GPT-OSS 120B 1k/8k 176.0 41.9 h200 trt (4 GPUs, conc: 8) 11.5 3.44
fp4 GPT-OSS 120B 1k/8k 426.7 17.1 b200-trt trt (4 GPUs, conc: 4) 11.7 3.52
fp4 GPT-OSS 120B 1k/8k 172.8 42.1 mi300x vllm (2 GPUs, conc: 4) 12.0 3.59
fp4 GPT-OSS 120B 1k/8k 172.7 43.1 mi300x vllm (8 GPUs, conc: 16) 12.2 3.66
fp4 GPT-OSS 120B 1k/8k 404.8 18.2 b200-trt trt (8 GPUs, conc: 8) 12.5 3.76
fp4 GPT-OSS 120B 1k/8k 245.6 30.1 mi355x vllm (8 GPUs, conc: 16) 12.6 3.79
fp4 GPT-OSS 120B 1k/8k 245.3 30.1 mi355x vllm (4 GPUs, conc: 8) 12.7 3.80
fp4 GPT-OSS 120B 1k/8k 190.0 38.9 mi325x vllm (8 GPUs, conc: 16) 13.4 4.03
fp4 GPT-OSS 120B 1k/8k 153.3 48.1 mi300x vllm (4 GPUs, conc: 8) 13.6 4.09
fp4 GPT-OSS 120B 1k/8k 182.3 40.0 mi325x vllm (2 GPUs, conc: 4) 13.8 4.15
fp4 GPT-OSS 120B 1k/8k 259.1 28.4 h200 vllm (8 GPUs, conc: 8) 15.6 4.68
fp4 GPT-OSS 120B 1k/8k 256.5 28.7 h100 vllm (8 GPUs, conc: 8) 15.8 4.73
fp4 GPT-OSS 120B 1k/8k 252.7 28.8 h100 vllm (4 GPUs, conc: 4) 15.8 4.74
fp4 GPT-OSS 120B 1k/8k 252.6 28.8 h200 vllm (4 GPUs, conc: 4) 15.8 4.74
fp4 GPT-OSS 120B 1k/8k 314.7 23.1 b200 vllm (4 GPUs, conc: 4) 15.9 4.78
fp4 GPT-OSS 120B 1k/8k 306.2 24.1 b200 vllm (8 GPUs, conc: 8) 16.6 4.98
fp4 GPT-OSS 120B 1k/8k 208.2 35.0 h200 trt (4 GPUs, conc: 4) 19.2 5.76
fp4 GPT-OSS 120B 1k/8k 201.8 36.8 mi300x vllm (8 GPUs, conc: 8) 20.9 6.27
fp4 GPT-OSS 120B 1k/8k 191.1 38.6 h200 trt (8 GPUs, conc: 8) 21.1 6.34
fp4 GPT-OSS 120B 1k/8k 459.7 15.8 b200-trt trt (8 GPUs, conc: 4) 21.8 6.55
fp4 GPT-OSS 120B 1k/8k 274.0 26.9 mi355x vllm (8 GPUs, conc: 8) 22.7 6.80
fp4 GPT-OSS 120B 1k/8k 262.3 27.8 mi355x vllm (4 GPUs, conc: 4) 23.4 7.01
fp4 GPT-OSS 120B 1k/8k 172.6 42.1 mi300x vllm (4 GPUs, conc: 4) 24.0 7.19
fp4 GPT-OSS 120B 1k/8k 212.7 34.6 mi325x vllm (8 GPUs, conc: 8) 24.0 7.20
fp4 GPT-OSS 120B 1k/8k 283.1 25.7 h200 vllm (8 GPUs, conc: 4) 28.2 8.46
fp4 GPT-OSS 120B 1k/8k 281.5 25.8 h100 vllm (8 GPUs, conc: 4) 28.4 8.52
fp4 GPT-OSS 120B 1k/8k 332.5 21.9 b200 vllm (8 GPUs, conc: 4) 30.2 9.05
fp4 GPT-OSS 120B 1k/8k 240.0 30.3 h200 trt (8 GPUs, conc: 4) 33.3 10.00
fp4 GPT-OSS 120B 1k/8k 216.8 33.6 mi300x vllm (8 GPUs, conc: 4) 38.2 11.45
fp4 GPT-OSS 120B 1k/8k 284.7 25.6 mi355x vllm (8 GPUs, conc: 4) 43.1 12.92
fp4 GPT-OSS 120B 1k/8k 225.9 32.2 mi325x vllm (8 GPUs, conc: 4) 44.6 13.38
Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 8k/1k 45.8 20.8 b200-trt trt (1 GPUs, conc: 128) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 93.1 10.5 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 44.3 21.4 b200 vllm (1 GPUs, conc: 128) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 74.8 12.7 b200-trt trt (2 GPUs, conc: 128) 0.14 0.04
fp4 GPT-OSS 120B 8k/1k 72.4 13.1 b200-trt trt (1 GPUs, conc: 64) 0.14 0.04
fp4 GPT-OSS 120B 8k/1k 145.7 6.7 b200-trt trt (2 GPUs, conc: 64, dp_attn) 0.14 0.04
fp4 GPT-OSS 120B 8k/1k 69.8 13.6 b200 vllm (1 GPUs, conc: 64) 0.14 0.04
fp4 GPT-OSS 120B 8k/1k 67.6 14.1 b200 vllm (2 GPUs, conc: 128) 0.15 0.04
fp4 GPT-OSS 120B 8k/1k 112.8 8.4 b200-trt trt (2 GPUs, conc: 64) 0.18 0.05
fp4 GPT-OSS 120B 8k/1k 103.8 9.2 b200 vllm (2 GPUs, conc: 64) 0.20 0.06
fp4 GPT-OSS 120B 8k/1k 103.1 9.2 b200 vllm (1 GPUs, conc: 32) 0.20 0.06
fp4 GPT-OSS 120B 8k/1k 30.4 31.4 mi355x vllm (1 GPUs, conc: 128) 0.20 0.06
fp4 GPT-OSS 120B 8k/1k 51.0 18.7 mi355x vllm (1 GPUs, conc: 64) 0.24 0.07
fp4 GPT-OSS 120B 8k/1k 162.9 5.8 b200-trt trt (2 GPUs, conc: 32) 0.25 0.08
fp4 GPT-OSS 120B 8k/1k 147.0 6.4 b200 vllm (1 GPUs, conc: 16) 0.28 0.08
fp4 GPT-OSS 120B 8k/1k 145.9 6.5 b200 vllm (2 GPUs, conc: 32) 0.28 0.08
fp4 GPT-OSS 120B 8k/1k 32.7 34.4 h200 trt (1 GPUs, conc: 64) 0.29 0.09
fp4 GPT-OSS 120B 8k/1k 139.5 6.8 b200 vllm (4 GPUs, conc: 64) 0.29 0.09
fp4 GPT-OSS 120B 8k/1k 78.6 12.2 mi355x vllm (1 GPUs, conc: 32) 0.32 0.10
fp4 GPT-OSS 120B 8k/1k 47.3 20.2 h200 trt (1 GPUs, conc: 32) 0.34 0.10
fp4 GPT-OSS 120B 8k/1k 23.3 41.1 h200 vllm (1 GPUs, conc: 64) 0.35 0.10
fp4 GPT-OSS 120B 8k/1k 45.6 21.0 h200 trt (2 GPUs, conc: 64) 0.35 0.11
fp4 GPT-OSS 120B 8k/1k 219.3 4.3 b200-trt trt (2 GPUs, conc: 16) 0.37 0.11
fp4 GPT-OSS 120B 8k/1k 21.7 44.0 mi300x vllm (1 GPUs, conc: 64) 0.38 0.12
fp4 GPT-OSS 120B 8k/1k 214.5 4.5 b200-trt trt (4 GPUs, conc: 32) 0.38 0.12
fp4 GPT-OSS 120B 8k/1k 26.3 36.3 mi325x vllm (1 GPUs, conc: 64) 0.39 0.12
fp4 GPT-OSS 120B 8k/1k 39.7 24.2 h200 vllm (1 GPUs, conc: 32) 0.41 0.12
fp4 GPT-OSS 120B 8k/1k 39.1 24.5 h200 vllm (2 GPUs, conc: 64) 0.41 0.12
fp4 GPT-OSS 120B 8k/1k 196.9 4.8 b200 vllm (1 GPUs, conc: 8) 0.41 0.12
fp4 GPT-OSS 120B 8k/1k 193.2 4.9 b200 vllm (2 GPUs, conc: 16) 0.42 0.13
fp4 GPT-OSS 120B 8k/1k 115.5 8.2 mi355x vllm (1 GPUs, conc: 16) 0.43 0.13
fp4 GPT-OSS 120B 8k/1k 37.3 25.7 h100 vllm (2 GPUs, conc: 64) 0.43 0.13
fp4 GPT-OSS 120B 8k/1k 188.4 5.1 b200 vllm (4 GPUs, conc: 32) 0.44 0.13
fp4 GPT-OSS 120B 8k/1k 70.5 13.5 h200 trt (1 GPUs, conc: 16) 0.46 0.14
fp4 GPT-OSS 120B 8k/1k 35.6 27.1 mi300x vllm (2 GPUs, conc: 64) 0.47 0.14
fp4 GPT-OSS 120B 8k/1k 69.2 13.8 h200 trt (2 GPUs, conc: 32) 0.47 0.14
fp4 GPT-OSS 120B 8k/1k 34.8 27.7 mi300x vllm (1 GPUs, conc: 32) 0.49 0.15
fp4 GPT-OSS 120B 8k/1k 42.2 22.7 mi325x vllm (1 GPUs, conc: 32) 0.49 0.15
fp4 GPT-OSS 120B 8k/1k 62.0 15.4 h200 vllm (1 GPUs, conc: 16) 0.53 0.16
fp4 GPT-OSS 120B 8k/1k 61.8 15.6 h200 vllm (2 GPUs, conc: 32) 0.53 0.16
fp4 GPT-OSS 120B 8k/1k 61.1 15.7 h200 vllm (4 GPUs, conc: 64) 0.53 0.16
fp4 GPT-OSS 120B 8k/1k 61.1 15.7 h200 trt (4 GPUs, conc: 64) 0.53 0.16
fp4 GPT-OSS 120B 8k/1k 59.2 16.2 h100 vllm (2 GPUs, conc: 32) 0.55 0.17
fp4 GPT-OSS 120B 8k/1k 58.1 16.5 h100 vllm (4 GPUs, conc: 64) 0.56 0.17
fp4 GPT-OSS 120B 8k/1k 290.5 3.3 b200-trt trt (2 GPUs, conc: 8) 0.56 0.17
fp4 GPT-OSS 120B 8k/1k 285.1 3.3 b200-trt trt (4 GPUs, conc: 16) 0.58 0.17
fp4 GPT-OSS 120B 8k/1k 55.7 17.2 mi300x vllm (2 GPUs, conc: 32) 0.61 0.18
fp4 GPT-OSS 120B 8k/1k 164.8 5.9 mi355x vllm (1 GPUs, conc: 8) 0.61 0.18
fp4 GPT-OSS 120B 8k/1k 65.0 14.6 mi325x vllm (1 GPUs, conc: 16) 0.63 0.19
fp4 GPT-OSS 120B 8k/1k 52.6 18.1 mi300x vllm (1 GPUs, conc: 16) 0.64 0.19
fp4 GPT-OSS 120B 8k/1k 102.2 9.4 h200 trt (1 GPUs, conc: 8) 0.64 0.19
fp4 GPT-OSS 120B 8k/1k 251.5 3.8 b200 vllm (1 GPUs, conc: 4) 0.65 0.19
fp4 GPT-OSS 120B 8k/1k 243.1 3.9 b200 vllm (2 GPUs, conc: 8) 0.67 0.20
fp4 GPT-OSS 120B 8k/1k 97.4 9.9 h200 vllm (4 GPUs, conc: 32) 0.67 0.20
fp4 GPT-OSS 120B 8k/1k 96.1 9.9 h200 vllm (2 GPUs, conc: 16) 0.68 0.20
fp4 GPT-OSS 120B 8k/1k 95.3 9.9 h200 trt (2 GPUs, conc: 16) 0.68 0.21
fp4 GPT-OSS 120B 8k/1k 49.2 19.5 mi300x vllm (4 GPUs, conc: 64) 0.68 0.21
fp4 GPT-OSS 120B 8k/1k 237.2 4.0 b200 vllm (4 GPUs, conc: 16) 0.69 0.21
fp4 GPT-OSS 120B 8k/1k 95.2 10.2 h200 vllm (1 GPUs, conc: 8) 0.70 0.21
fp4 GPT-OSS 120B 8k/1k 93.4 10.3 h100 vllm (4 GPUs, conc: 32) 0.70 0.21
fp4 GPT-OSS 120B 8k/1k 90.9 10.5 h100 vllm (2 GPUs, conc: 16) 0.72 0.22
fp4 GPT-OSS 120B 8k/1k 91.2 10.5 h200 trt (4 GPUs, conc: 32) 0.72 0.22
fp4 GPT-OSS 120B 8k/1k 79.3 12.2 mi300x vllm (1 GPUs, conc: 8) 0.86 0.26
fp4 GPT-OSS 120B 8k/1k 79.0 12.1 mi300x vllm (2 GPUs, conc: 16) 0.86 0.26
fp4 GPT-OSS 120B 8k/1k 359.3 2.6 b200-trt trt (2 GPUs, conc: 4) 0.91 0.27
fp4 GPT-OSS 120B 8k/1k 90.6 10.7 mi325x vllm (1 GPUs, conc: 8) 0.92 0.28
fp4 GPT-OSS 120B 8k/1k 353.2 2.7 b200-trt trt (4 GPUs, conc: 8) 0.93 0.28
fp4 GPT-OSS 120B 8k/1k 217.6 4.4 mi355x vllm (1 GPUs, conc: 4) 0.93 0.28
fp4 GPT-OSS 120B 8k/1k 141.1 6.7 h200 vllm (4 GPUs, conc: 16) 0.93 0.28
fp4 GPT-OSS 120B 8k/1k 139.9 6.9 h200 vllm (2 GPUs, conc: 8) 0.95 0.28
fp4 GPT-OSS 120B 8k/1k 71.7 13.4 mi300x vllm (4 GPUs, conc: 32) 0.95 0.28
fp4 GPT-OSS 120B 8k/1k 135.0 7.0 h100 vllm (4 GPUs, conc: 16) 0.97 0.29
fp4 GPT-OSS 120B 8k/1k 136.4 7.1 h200 vllm (1 GPUs, conc: 4) 0.97 0.29
fp4 GPT-OSS 120B 8k/1k 132.4 7.3 h100 vllm (2 GPUs, conc: 8) 1.00 0.30
fp4 GPT-OSS 120B 8k/1k 130.8 7.3 h200 trt (1 GPUs, conc: 4) 1.00 0.30
fp4 GPT-OSS 120B 8k/1k 131.1 7.3 h200 trt (2 GPUs, conc: 8) 1.00 0.30
fp4 GPT-OSS 120B 8k/1k 129.6 7.4 h200 vllm (8 GPUs, conc: 32) 1.01 0.30
fp4 GPT-OSS 120B 8k/1k 120.4 7.9 h200 trt (4 GPUs, conc: 16) 1.08 0.32
fp4 GPT-OSS 120B 8k/1k 283.2 3.4 b200 vllm (4 GPUs, conc: 8) 1.15 0.35
fp4 GPT-OSS 120B 8k/1k 279.5 3.4 b200 vllm (2 GPUs, conc: 4) 1.16 0.35
fp4 GPT-OSS 120B 8k/1k 116.8 8.3 mi300x vllm (2 GPUs, conc: 8) 1.17 0.35
fp4 GPT-OSS 120B 8k/1k 111.9 8.6 mi300x vllm (1 GPUs, conc: 4) 1.22 0.37
fp4 GPT-OSS 120B 8k/1k 128.6 7.5 mi325x vllm (1 GPUs, conc: 4) 1.29 0.39
fp4 GPT-OSS 120B 8k/1k 122.7 8.0 mi325x vllm (2 GPUs, conc: 8) 1.38 0.41
fp4 GPT-OSS 120B 8k/1k 190.7 5.1 h200 vllm (4 GPUs, conc: 8) 1.38 0.41
fp4 GPT-OSS 120B 8k/1k 96.1 9.9 mi300x vllm (4 GPUs, conc: 16) 1.40 0.42
fp4 GPT-OSS 120B 8k/1k 184.0 5.3 h200 vllm (2 GPUs, conc: 4) 1.44 0.43
fp4 GPT-OSS 120B 8k/1k 182.5 5.3 h100 vllm (4 GPUs, conc: 8) 1.44 0.43
fp4 GPT-OSS 120B 8k/1k 177.6 5.3 h200 vllm (8 GPUs, conc: 16) 1.47 0.44
fp4 GPT-OSS 120B 8k/1k 172.7 5.6 h100 vllm (2 GPUs, conc: 4) 1.52 0.46
fp4 GPT-OSS 120B 8k/1k 171.1 5.6 h100 vllm (8 GPUs, conc: 16) 1.53 0.46
fp4 GPT-OSS 120B 8k/1k 413.8 2.3 b200-trt trt (4 GPUs, conc: 4) 1.58 0.47
fp4 GPT-OSS 120B 8k/1k 159.9 6.0 h200 trt (2 GPUs, conc: 4) 1.63 0.49
fp4 GPT-OSS 120B 8k/1k 155.7 6.1 h200 trt (4 GPUs, conc: 8) 1.68 0.50
fp4 GPT-OSS 120B 8k/1k 369.9 2.6 b200-trt trt (8 GPUs, conc: 8) 1.77 0.53
fp4 GPT-OSS 120B 8k/1k 149.8 6.4 mi300x vllm (2 GPUs, conc: 4) 1.82 0.54
fp4 GPT-OSS 120B 8k/1k 136.4 7.1 mi300x vllm (4 GPUs, conc: 8) 2.00 0.60
fp4 GPT-OSS 120B 8k/1k 138.2 7.0 mi300x vllm (8 GPUs, conc: 16) 2.00 0.60
fp4 GPT-OSS 120B 8k/1k 162.1 5.9 mi325x vllm (2 GPUs, conc: 4) 2.05 0.62
fp4 GPT-OSS 120B 8k/1k 156.7 6.0 mi325x vllm (8 GPUs, conc: 16) 2.11 0.63
fp4 GPT-OSS 120B 8k/1k 306.3 3.1 b200 vllm (4 GPUs, conc: 4) 2.11 0.63
fp4 GPT-OSS 120B 8k/1k 233.0 4.1 h200 vllm (4 GPUs, conc: 4) 2.25 0.68
fp4 GPT-OSS 120B 8k/1k 227.8 4.2 h200 vllm (8 GPUs, conc: 8) 2.30 0.69
fp4 GPT-OSS 120B 8k/1k 226.9 4.2 h100 vllm (4 GPUs, conc: 4) 2.31 0.69
fp4 GPT-OSS 120B 8k/1k 225.8 4.3 h100 vllm (8 GPUs, conc: 8) 2.33 0.70
fp4 GPT-OSS 120B 8k/1k 143.2 6.8 mi325x vllm (4 GPUs, conc: 8) 2.34 0.70
fp4 GPT-OSS 120B 8k/1k 194.9 4.9 h200 trt (4 GPUs, conc: 4) 2.67 0.80
fp4 GPT-OSS 120B 8k/1k 188.2 5.2 mi300x vllm (4 GPUs, conc: 4) 2.95 0.89
fp4 GPT-OSS 120B 8k/1k 431.9 2.2 b200-trt trt (8 GPUs, conc: 4) 3.04 0.91
fp4 GPT-OSS 120B 8k/1k 172.5 5.6 h200 trt (8 GPUs, conc: 8) 3.04 0.91
fp4 GPT-OSS 120B 8k/1k 173.1 5.6 mi300x vllm (8 GPUs, conc: 8) 3.19 0.96
fp4 GPT-OSS 120B 8k/1k 246.5 3.9 mi355x vllm (8 GPUs, conc: 8) 3.27 0.98
fp4 GPT-OSS 120B 8k/1k 244.0 3.9 mi355x vllm (4 GPUs, conc: 4) 3.32 0.99
fp4 GPT-OSS 120B 8k/1k 191.8 5.0 mi325x vllm (8 GPUs, conc: 8) 3.44 1.03
fp4 GPT-OSS 120B 8k/1k 263.1 3.6 h100 vllm (8 GPUs, conc: 4) 3.97 1.19
fp4 GPT-OSS 120B 8k/1k 262.9 3.6 h200 vllm (8 GPUs, conc: 4) 3.98 1.19
fp4 GPT-OSS 120B 8k/1k 322.6 2.9 b200 vllm (8 GPUs, conc: 4) 4.02 1.21
fp4 GPT-OSS 120B 8k/1k 165.2 5.8 mi325x vllm (4 GPUs, conc: 4) 4.04 1.21
fp4 GPT-OSS 120B 8k/1k 225.7 4.3 h200 trt (8 GPUs, conc: 4) 4.67 1.40
fp4 GPT-OSS 120B 8k/1k 206.0 4.6 mi300x vllm (8 GPUs, conc: 4) 5.25 1.58
fp4 GPT-OSS 120B 8k/1k 273.0 3.5 mi355x vllm (8 GPUs, conc: 4) 5.88 1.76
fp4 GPT-OSS 120B 8k/1k 216.0 4.4 mi325x vllm (8 GPUs, conc: 4) 6.06 1.82
[
{
"hw": "gb200",
"conc": 4096,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 2,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 4278.618633008631,
"output_tput_per_gpu": 3209.667625131325,
"input_tput_per_gpu": 6416.520648763243,
"mean_ttft": 4.0594120494029315,
"median_ttft": 1.121072229463607,
"std_ttft": 6.372709353511065,
"p99_ttft": 24.213953527775594,
"mean_tpot": 0.03280247992745935,
"mean_intvty": 30.485499944255377,
"median_tpot": 0.033811291526681636,
"median_intvty": 29.57591842390481,
"std_tpot": 0.0022621950483043625,
"std_intvty": 442.0485319113195,
"p99_tpot": 0.034867009720247394,
"p99_intvty": 28.680406149635953,
"mean_itl": 1.5977638690671125,
"median_itl": 1.6682301629334688,
"std_itl": 0.27062324387807934,
"p99_itl": 1.9583426561020316,
"mean_e2el": 34.280787857915925,
"median_e2el": 32.77699998789467,
"std_e2el": 6.559754606670555,
"p99_e2el": 55.63821561905089
},
{
"hw": "gb200",
"conc": 2048,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 10,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 6178.135380619163,
"output_tput_per_gpu": 1544.8506706034038,
"input_tput_per_gpu": 9884.76314863177,
"mean_ttft": 14.37463802805496,
"median_ttft": 12.951106991618872,
"std_ttft": 7.169695450006314,
"p99_ttft": 35.84106000921224,
"mean_tpot": 0.021113882238954162,
"mean_intvty": 47.36220410261856,
"median_tpot": 0.021393940481088246,
"median_intvty": 46.742207256488214,
"std_tpot": 0.0009060877723466099,
"std_intvty": 1103.6458393099972,
"p99_tpot": 0.021679906364232872,
"p99_intvty": 46.125660471014875,
"mean_itl": 1.0285547093681142,
"median_itl": 1.0607960890047252,
"std_itl": 0.1516292277771637,
"p99_itl": 1.175489416951314,
"mean_e2el": 33.816330829492564,
"median_e2el": 32.53014694270678,
"std_e2el": 7.370433424416038,
"p99_e2el": 55.94178578473162
},
{
"hw": "gb200",
"conc": 1024,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 10,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 4816.889398078987,
"output_tput_per_gpu": 1204.2126289706246,
"input_tput_per_gpu": 7707.030813365677,
"mean_ttft": 3.4166035368342818,
"median_ttft": 1.2882808176800609,
"std_ttft": 4.420128956945235,
"p99_ttft": 17.432914412734103,
"mean_tpot": 0.020026315547250588,
"mean_intvty": 49.93429758162829,
"median_tpot": 0.020334185889484123,
"median_intvty": 49.17826587378414,
"std_tpot": 0.0008671811123108214,
"std_intvty": 1153.1616473233018,
"p99_tpot": 0.020749540452772344,
"p99_intvty": 48.193838426257294,
"mean_itl": 0.974798356238614,
"median_itl": 1.0089515121653676,
"std_itl": 0.14501591483178752,
"p99_itl": 1.0936697078868747,
"mean_e2el": 21.861996859132795,
"median_e2el": 20.313978711608797,
"std_e2el": 4.6441392837949165,
"p99_e2el": 36.944439084003726
},
{
"hw": "gb200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 4,
"tput_per_gpu": 617.3634404161527,
"output_tput_per_gpu": 137.35359250739145,
"input_tput_per_gpu": 1097.373288324914,
"mean_ttft": 1.409794175333809,
"median_ttft": 1.0516643594019115,
"std_ttft": 0.9286639272086951,
"p99_ttft": 3.9666541468678043,
"mean_tpot": 0.012405594975152857,
"mean_intvty": 80.60878998572001,
"median_tpot": 0.012331561314141837,
"median_intvty": 81.09273226036673,
"std_tpot": 0.00035622558215633534,
"std_intvty": 2807.2099537229024,
"p99_tpot": 0.012962362963303078,
"p99_intvty": 77.14642791835382,
"mean_itl": 0.012398346525175031,
"median_itl": 0.01247395807877183,
"std_itl": 0.0006679296721093565,
"p99_itl": 0.013335339110344648,
"mean_e2el": 12.805424300511367,
"median_e2el": 12.642208656528965,
"std_e2el": 1.0770451590143442,
"p99_e2el": 15.805625666505657
},
{
"hw": "gb200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 4,
"tput_per_gpu": 1392.3701550447925,
"output_tput_per_gpu": 308.56704083306283,
"input_tput_per_gpu": 2476.173269256522,
"mean_ttft": 2.8429133319295943,
"median_ttft": 1.569999423576519,
"std_ttft": 3.4977662318625575,
"p99_ttft": 15.279852117639967,
"mean_tpot": 0.02126292136087345,
"mean_intvty": 47.030226140051035,
"median_tpot": 0.02178826510112661,
"median_intvty": 45.89626550616427,
"std_tpot": 0.0014177271442042333,
"std_intvty": 705.354344161406,
"p99_tpot": 0.022296361474308413,
"p99_intvty": 44.850367229302286,
"mean_itl": 0.021274579363371016,
"median_itl": 0.02172287437133491,
"std_itl": 0.002120219026638371,
"p99_itl": 0.02545290503185243,
"mean_e2el": 22.291335885605076,
"median_e2el": 21.486765656620264,
"std_e2el": 4.046726610035571,
"p99_e2el": 35.46527161488775
},
{
"hw": "gb200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 4,
"tput_per_gpu": 373.33549125966925,
"output_tput_per_gpu": 83.7876568221312,
"input_tput_per_gpu": 662.8833256972073,
"mean_ttft": 1.0159053148934618,
"median_ttft": 0.8029270560946316,
"std_ttft": 0.4904499915695241,
"p99_ttft": 2.021822245661169,
"mean_tpot": 0.010441665989119386,
"mean_intvty": 95.7701578504846,
"median_tpot": 0.010441750406642691,
"median_intvty": 95.76938358570929,
"std_tpot": 7.472942707421537e-05,
"std_intvty": 13381.609349244427,
"p99_tpot": 0.01054028720291461,
"p99_intvty": 94.87407513179329,
"mean_itl": 0.010440534433254767,
"median_itl": 0.010449508670717478,
"std_itl": 0.00021594226969056442,
"p99_itl": 0.010737927183508873,
"mean_e2el": 10.769974423572421,
"median_e2el": 10.712829541647807,
"std_e2el": 0.6229327262623285,
"p99_e2el": 11.970187222193927
},
{
"hw": "gb200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 4,
"tput_per_gpu": 965.495756500815,
"output_tput_per_gpu": 217.07601464386082,
"input_tput_per_gpu": 1713.9154983577691,
"mean_ttft": 1.6880999097309541,
"median_ttft": 0.9066470253746957,
"std_ttft": 1.858998179381785,
"p99_ttft": 7.95349451332353,
"mean_tpot": 0.01579618874722121,
"mean_intvty": 63.30640991966591,
"median_tpot": 0.016028176642352163,
"median_intvty": 62.39012847897141,
"std_tpot": 0.000642096023848851,
"std_intvty": 1557.399458737965,
"p99_tpot": 0.016519406986650432,
"p99_intvty": 60.53486065257151,
"mean_itl": 0.015804840622480835,
"median_itl": 0.01607325579971075,
"std_itl": 0.001164747747679348,
"p99_itl": 0.01866403456777333,
"mean_e2el": 16.30540421393234,
"median_e2el": 16.034422776196152,
"std_e2el": 2.3564627283605097,
"p99_e2el": 23.68057620668783
},
{
"hw": "gb200",
"conc": 512,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 48,
"tput_per_gpu": 945.6248952601251,
"output_tput_per_gpu": 630.7213919298459,
"input_tput_per_gpu": 1890.3354052509624,
"mean_ttft": 0.9192043186430965,
"median_ttft": 0.49568672850728035,
"std_ttft": 0.8264347256997125,
"p99_ttft": 3.1728693629102778,
"mean_tpot": 0.014765241935488293,
"mean_intvty": 67.7266247562458,
"median_tpot": 0.014800411611521146,
"median_intvty": 67.56568845839165,
"std_tpot": 0.00014206144341296015,
"std_intvty": 7039.207655331839,
"p99_tpot": 0.014961964205778125,
"p99_intvty": 66.83614438897082,
"mean_itl": 0.7189300954888933,
"median_itl": 0.7360187023878098,
"std_itl": 0.09645055566976125,
"p99_itl": 0.7881611580541358,
"mean_e2el": 14.516531319739078,
"median_e2el": 14.479597188532352,
"std_e2el": 1.2374311625251724,
"p99_e2el": 17.569088365873323
},
{
"hw": "gb200",
"conc": 4096,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 48,
"tput_per_gpu": 4528.158262807484,
"output_tput_per_gpu": 3019.434120988546,
"input_tput_per_gpu": 9054.330688264297,
"mean_ttft": 3.7107530068997674,
"median_ttft": 0.6848360581789166,
"std_ttft": 5.733402996955814,
"p99_ttft": 22.25122923822607,
"mean_tpot": 0.022214963485737715,
"mean_intvty": 45.01470374425835,
"median_tpot": 0.022798849868907147,
"median_intvty": 43.8618617057429,
"std_tpot": 0.0015261035797248476,
"std_intvty": 655.2635176835751,
"p99_tpot": 0.023664093977379064,
"p99_intvty": 42.25811480278595,
"mean_itl": 1.082089525675462,
"median_itl": 1.120221873279661,
"std_itl": 0.19500577552618178,
"p99_itl": 1.4697376866359266,
"mean_e2el": 24.17825435021057,
"median_e2el": 22.410813255701214,
"std_e2el": 5.904306315687309,
"p99_e2el": 44.19642562753055
},
{
"hw": "gb200",
"conc": 2048,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 48,
"tput_per_gpu": 2972.5471103387513,
"output_tput_per_gpu": 1981.7564307490893,
"input_tput_per_gpu": 5944.919149107737,
"mean_ttft": 2.0744397656439104,
"median_ttft": 0.5666985029820353,
"std_ttft": 2.9234307132580444,
"p99_ttft": 11.039596027904192,
"mean_tpot": 0.017785910902460712,
"mean_intvty": 56.22427805267191,
"median_tpot": 0.017953050814212,
"median_intvty": 55.70083939206475,
"std_tpot": 0.0004796368694287533,
"std_intvty": 2084.910614129807,
"p99_tpot": 0.018330361796258604,
"p99_intvty": 54.554296915411086,
"mean_itl": 0.8663720697900995,
"median_itl": 0.8847042026463896,
"std_itl": 0.12313044340578266,
"p99_itl": 1.0115102106519045,
"mean_e2el": 18.453271301539097,
"median_e2el": 17.585345333907753,
"std_e2el": 3.107245321742571,
"p99_e2el": 28.259716657106768
},
{
"hw": "gb200",
"conc": 8192,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 48,
"tput_per_gpu": 5208.865460841506,
"output_tput_per_gpu": 3471.315684450468,
"input_tput_per_gpu": 10421.514790014622,
"mean_ttft": 18.67925368150204,
"median_ttft": 16.92725436296314,
"std_ttft": 8.68993834196423,
"p99_ttft": 48.70451707869768,
"mean_tpot": 0.025343497762427735,
"mean_intvty": 39.45785263636817,
"median_tpot": 0.02587379518315719,
"median_intvty": 38.64914261402827,
"std_tpot": 0.0035206145684322458,
"std_intvty": 284.04131737866066,
"p99_tpot": 0.03185785187152452,
"p99_intvty": 31.3894359240784,
"mean_itl": 1.2340213837140477,
"median_itl": 1.202903261873871,
"std_itl": 0.4772133335725816,
"p99_itl": 2.8715267923753705,
"mean_e2el": 42.0043967642369,
"median_e2el": 41.093190079322085,
"std_e2el": 9.085788835125367,
"p99_e2el": 71.25736346791963
},
{
"hw": "gb200",
"conc": 1024,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 48,
"tput_per_gpu": 1702.1019698009002,
"output_tput_per_gpu": 1135.2575342623293,
"input_tput_per_gpu": 3402.6352764166127,
"mean_ttft": 1.5374525961106884,
"median_ttft": 0.517270227195695,
"std_ttft": 1.8832403525122978,
"p99_ttft": 6.760636934894137,
"mean_tpot": 0.015722519270752917,
"mean_intvty": 63.603038595742305,
"median_tpot": 0.015773389865921375,
"median_intvty": 63.39791309923263,
"std_tpot": 0.00020636709836954477,
"std_intvty": 4845.733684781886,
"p99_tpot": 0.015979969503659805,
"p99_intvty": 62.57834220340505,
"mean_itl": 0.765288075803214,
"median_itl": 0.7812047712504864,
"std_itl": 0.10546839815192023,
"p99_itl": 0.8595614690426736,
"mean_e2el": 16.02232296762322,
"median_e2el": 15.57143703661859,
"std_e2el": 2.0981124311614665,
"p99_e2el": 21.879631680240855
},
{
"hw": "gb200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 217.6100368420877,
"output_tput_per_gpu": 30.523877843853302,
"input_tput_per_gpu": 965.9546728350253,
"mean_ttft": 0.35208846549503503,
"median_ttft": 0.28792229061946273,
"std_ttft": 0.15516667677412294,
"p99_ttft": 0.732905226261355,
"mean_tpot": 0.007574944095937089,
"mean_intvty": 132.0141755945581,
"median_tpot": 0.0076449990736158,
"median_intvty": 130.80446320146345,
"std_tpot": 0.00022328538581642247,
"std_intvty": 4478.573446907831,
"p99_tpot": 0.007846967199330467,
"p99_intvty": 127.43776985397923,
"mean_itl": 0.37131636241069577,
"median_itl": 0.38255938375368714,
"std_itl": 0.04102878109104923,
"p99_itl": 0.393319065310061,
"mean_e2el": 7.425664985366166,
"median_e2el": 7.524429407902062,
"std_e2el": 0.4252198315302799,
"p99_e2el": 8.07544023828581
},
{
"hw": "gb200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 412.143843280104,
"output_tput_per_gpu": 57.30967550711122,
"input_tput_per_gpu": 1831.480514372075,
"mean_ttft": 0.5069185924134217,
"median_ttft": 0.3793605016544461,
"std_ttft": 0.3233784530012588,
"p99_ttft": 1.42577438860666,
"mean_tpot": 0.007827811523369385,
"mean_intvty": 127.74962670148224,
"median_tpot": 0.007881704532582981,
"median_intvty": 126.87610857093134,
"std_tpot": 0.00013103183773917787,
"std_intvty": 7631.7329990481,
"p99_tpot": 0.007948917777446642,
"p99_intvty": 125.80328895051433,
"mean_itl": 0.3822019593004795,
"median_itl": 0.39454076904803514,
"std_itl": 0.04929419825259172,
"p99_itl": 0.3977893521822989,
"mean_e2el": 7.701870356302242,
"median_e2el": 7.666022487450391,
"std_e2el": 0.5754033251295854,
"p99_e2el": 9.021973774842918
},
{
"hw": "gb200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 1293.1976657896728,
"output_tput_per_gpu": 179.11821772893902,
"input_tput_per_gpu": 5749.515458032608,
"mean_ttft": 1.082876710925484,
"median_ttft": 0.6965431980788708,
"std_ttft": 1.212027696474345,
"p99_ttft": 5.4800813592830675,
"mean_tpot": 0.009261366028170182,
"mean_intvty": 107.97543223735164,
"median_tpot": 0.009294762335339848,
"median_intvty": 107.58747388277752,
"std_tpot": 0.00021144527289392834,
"std_intvty": 4729.356141726804,
"p99_tpot": 0.009588429796164076,
"p99_intvty": 104.29236290597419,
"mean_itl": 0.4509536072352316,
"median_itl": 0.46397397154942155,
"std_itl": 0.06349973751727211,
"p99_itl": 0.5315911201899871,
"mean_e2el": 9.549530582592705,
"median_e2el": 9.209197696531191,
"std_e2el": 1.337221958141556,
"p99_e2el": 13.981285048103889
},
{
"hw": "gb200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 1868.1853921776353,
"output_tput_per_gpu": 261.1438817027295,
"input_tput_per_gpu": 8296.351434077258,
"mean_ttft": 2.5716690307701358,
"median_ttft": 1.811208988307044,
"std_ttft": 2.470225423307189,
"p99_ttft": 11.534478100528942,
"mean_tpot": 0.011561046616368701,
"mean_intvty": 86.49735903531005,
"median_tpot": 0.01173782559353714,
"median_intvty": 85.1946548388486,
"std_tpot": 0.0007103119677333631,
"std_intvty": 1407.8321152198014,
"p99_tpot": 0.01254047834691132,
"p99_intvty": 79.74177478216346,
"mean_itl": 0.5624637108804805,
"median_itl": 0.589180750772357,
"std_itl": 0.08972705467456576,
"p99_itl": 0.7099941460043192,
"mean_e2el": 13.253206338906603,
"median_e2el": 12.639715885743499,
"std_e2el": 2.6760704863866267,
"p99_e2el": 22.8390625726711
},
{
"hw": "gb200",
"conc": 2048,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 3,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 2273.035502829099,
"output_tput_per_gpu": 1704.8268295087219,
"input_tput_per_gpu": 3409.4528494698534,
"mean_ttft": 2.173086327583178,
"median_ttft": 0.923469693865627,
"std_ttft": 2.435319027688263,
"p99_ttft": 8.594428160078824,
"mean_tpot": 0.020833040576127223,
"mean_intvty": 48.00067452208149,
"median_tpot": 0.020869683186570197,
"median_intvty": 47.91639581014377,
"std_tpot": 0.00034376288791396225,
"std_intvty": 2908.9818452138506,
"p99_tpot": 0.021504136305171034,
"p99_intvty": 46.50268142875996,
"mean_itl": 1.0149396608436485,
"median_itl": 1.0127411750145257,
"std_itl": 0.1428828313824563,
"p99_itl": 1.2200002679880708,
"mean_e2el": 21.360599801532498,
"median_e2el": 20.811633503995836,
"std_e2el": 2.6407595115317326,
"p99_e2el": 28.635953118805773
},
{
"hw": "gb200",
"conc": 4096,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 3,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 3889.490161926362,
"output_tput_per_gpu": 2917.757276774019,
"input_tput_per_gpu": 5832.955932231048,
"mean_ttft": 3.2228150604186796,
"median_ttft": 1.1509118156973273,
"std_ttft": 4.272624738855067,
"p99_ttft": 16.09225718161091,
"mean_tpot": 0.02351029978501919,
"mean_intvty": 42.53454907611182,
"median_tpot": 0.02369491915970625,
"median_intvty": 42.20314039731027,
"std_tpot": 0.0007693366400441444,
"std_intvty": 1299.8211029473653,
"p99_tpot": 0.024604888143947488,
"p99_intvty": 40.64233066818425,
"mean_itl": 1.1451732408386102,
"median_itl": 1.1633540452457964,
"std_itl": 0.20137232310378922,
"p99_itl": 1.7198904642928394,
"mean_e2el": 24.883531948000904,
"median_e2el": 23.598644719924778,
"std_e2el": 4.5238368131830855,
"p99_e2el": 39.057467378573484
},
{
"hw": "gb200",
"conc": 1024,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 3,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 577.5571788593842,
"output_tput_per_gpu": 433.3674886702713,
"input_tput_per_gpu": 865.9365592376099,
"mean_ttft": 24.27998280096772,
"median_ttft": 0.6576457459013909,
"std_ttft": 41.104692271584916,
"p99_ttft": 151.22405191967263,
"mean_tpot": 0.019829441813251195,
"mean_intvty": 50.43006300518966,
"median_tpot": 0.019859606724720976,
"median_intvty": 50.353464389363424,
"std_tpot": 0.0002649246667632852,
"std_intvty": 3774.6579517018604,
"p99_tpot": 0.020417650269181357,
"p99_intvty": 48.977232287566984,
"mean_itl": 0.9653042342743093,
"median_itl": 0.9830342491623014,
"std_itl": 0.13166089808484807,
"p99_itl": 1.0928751633223142,
"mean_e2el": 42.55062777281237,
"median_e2el": 19.928533124970272,
"std_e2el": 41.00853908724329,
"p99_e2el": 170.2791048961738
},
{
"hw": "gb200",
"conc": 15000,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 7058.240294554526,
"output_tput_per_gpu": 5292.975747977568,
"input_tput_per_gpu": 10588.769387708444,
"mean_ttft": 33.981886163931456,
"median_ttft": 29.44833537749946,
"std_ttft": 14.554088927469879,
"p99_ttft": 82.99984531727618,
"mean_tpot": 0.04587534947812232,
"mean_intvty": 21.7981990628081,
"median_tpot": 0.04736373526688202,
"median_intvty": 21.113199674081166,
"std_tpot": 0.007845016383478254,
"std_intvty": 127.46945973319036,
"p99_tpot": 0.05843079201985693,
"p99_intvty": 17.114263993891495,
"mean_itl": 2.2335955183835323,
"median_itl": 2.0644799252040684,
"std_itl": 1.5627868388317807,
"p99_itl": 7.072170960297807,
"mean_e2el": 76.20967707991602,
"median_e2el": 74.69947930565104,
"std_e2el": 16.292206136036533,
"p99_e2el": 125.08947688786883
},
{
"hw": "gb200",
"conc": 8192,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 6429.690319625121,
"output_tput_per_gpu": 4820.51622335467,
"input_tput_per_gpu": 9648.038512166024,
"mean_ttft": 10.531389517558683,
"median_ttft": 6.206002830993384,
"std_ttft": 11.254021478860853,
"p99_ttft": 46.657765892082814,
"mean_tpot": 0.03831360522577669,
"mean_intvty": 26.100388989946016,
"median_tpot": 0.03978744076641734,
"median_intvty": 25.133559252296813,
"std_tpot": 0.005567185543484171,
"std_intvty": 179.6239755598588,
"p99_tpot": 0.04546956101251224,
"p99_intvty": 21.992734869923677,
"mean_itl": 1.865548199692893,
"median_itl": 1.8421797938644886,
"std_itl": 0.674928761112424,
"p99_itl": 3.788474251898005,
"mean_e2el": 45.79348411997684,
"median_e2el": 42.84485572390258,
"std_e2el": 11.12247693206909,
"p99_e2el": 83.45429676168132
},
{
"hw": "gb200",
"conc": 12000,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 7036.077007114046,
"output_tput_per_gpu": 5276.052478721792,
"input_tput_per_gpu": 10556.126063898553,
"mean_ttft": 19.59586006073137,
"median_ttft": 14.512689379043877,
"std_ttft": 13.072154615474833,
"p99_ttft": 63.17042052471079,
"mean_tpot": 0.045389059587552724,
"mean_intvty": 22.03174088837556,
"median_tpot": 0.04735952332744131,
"median_intvty": 21.11507738551445,
"std_tpot": 0.00764400244283125,
"std_intvty": 130.82151758570234,
"p99_tpot": 0.05540198583258537,
"p99_intvty": 18.04989451139561,
"mean_itl": 2.2099891187211536,
"median_itl": 2.059060634113848,
"std_itl": 1.1861203411472,
"p99_itl": 5.537419993691146,
"mean_e2el": 61.37412041979366,
"median_e2el": 59.02951031853445,
"std_e2el": 13.28996922486923,
"p99_e2el": 103.9737890055636
},
{
"hw": "gb200",
"conc": 8192,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 10,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 6784.322481285878,
"output_tput_per_gpu": 1695.7086557283164,
"input_tput_per_gpu": 10855.213541731926,
"mean_ttft": 104.1697003534269,
"median_ttft": 108.77370800008066,
"std_ttft": 24.902864980671353,
"p99_ttft": 144.97866864022788,
"mean_tpot": 0.02115121877466642,
"mean_intvty": 47.27859943455061,
"median_tpot": 0.021297992074692185,
"median_intvty": 46.95278298972946,
"std_tpot": 0.0012097151186533513,
"std_intvty": 826.6409046066937,
"p99_tpot": 0.022111509757319445,
"p99_intvty": 45.2253152758588,
"mean_itl": 1.029987184484207,
"median_itl": 1.058924971614033,
"std_itl": 0.22491745897869495,
"p99_itl": 1.4813344611320671,
"mean_e2el": 123.63751416483733,
"median_e2el": 128.33443384896964,
"std_e2el": 25.35750466814555,
"p99_e2el": 164.92041043341618
},
{
"hw": "gb200",
"conc": 4096,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 4087.1371969500738,
"output_tput_per_gpu": 681.8083229364833,
"input_tput_per_gpu": 10897.794944977255,
"mean_ttft": 87.26390536252737,
"median_ttft": 88.80856403918006,
"std_ttft": 24.792867821096138,
"p99_ttft": 142.47943472840353,
"mean_tpot": 0.01649263800631278,
"mean_intvty": 60.63311397589861,
"median_tpot": 0.016446754362602393,
"median_intvty": 60.80227003778079,
"std_tpot": 0.0005087030566835883,
"std_intvty": 1965.7833521177304,
"p99_tpot": 0.018461462020920787,
"p99_intvty": 54.16689094649091,
"mean_itl": 0.8033217643192261,
"median_itl": 0.8168667037971318,
"std_itl": 0.11931208877231939,
"p99_itl": 0.98169526765123,
"mean_e2el": 102.46116052240248,
"median_e2el": 104.12603010842577,
"std_e2el": 24.969835254052573,
"p99_e2el": 157.94925425253805
},
{
"hw": "gb200",
"conc": 1024,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 3761.781179008175,
"output_tput_per_gpu": 626.958468950556,
"input_tput_per_gpu": 10031.426599123413,
"mean_ttft": 12.885980097311768,
"median_ttft": 13.596795520512387,
"std_ttft": 5.236741227069504,
"p99_ttft": 28.989538250626065,
"mean_tpot": 0.01628865219987436,
"mean_intvty": 61.39243368507269,
"median_tpot": 0.016426459033301727,
"median_intvty": 60.877392867974635,
"std_tpot": 0.0004324491445238866,
"std_intvty": 2312.4106329334277,
"p99_tpot": 0.016640366701774486,
"p99_intvty": 60.09482951438579,
"mean_itl": 0.7928901611023834,
"median_itl": 0.8141690269112587,
"std_itl": 0.10988229273849873,
"p99_itl": 0.8658312988700345,
"mean_e2el": 27.889258251864156,
"median_e2el": 28.497861040988937,
"std_e2el": 5.377553653242919,
"p99_e2el": 44.33363082781902
},
{
"hw": "gb200",
"conc": 2048,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 4002.808100082403,
"output_tput_per_gpu": 667.271530620437,
"input_tput_per_gpu": 10673.881239006334,
"mean_ttft": 37.65794607501762,
"median_ttft": 39.70852206996642,
"std_ttft": 8.540045808639295,
"p99_ttft": 55.49668451129925,
"mean_tpot": 0.016409757405646146,
"mean_intvty": 60.93935304954158,
"median_tpot": 0.0164533262685425,
"median_intvty": 60.77798395768298,
"std_tpot": 0.0004252605171329334,
"std_intvty": 2351.499750651451,
"p99_tpot": 0.017065855799152253,
"p99_intvty": 58.59653402495496,
"mean_itl": 0.7994012313407735,
"median_itl": 0.8162725509610027,
"std_itl": 0.11115528297350057,
"p99_itl": 0.9037005412392318,
"mean_e2el": 52.76819057768371,
"median_e2el": 54.84260409977287,
"std_e2el": 8.76103758332825,
"p99_e2el": 70.70674430020621
},
{
"hw": "gb200",
"conc": 512,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "false",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 3105.3255984170055,
"output_tput_per_gpu": 517.7031139748201,
"input_tput_per_gpu": 8280.570567301376,
"mean_ttft": 2.8975655836626175,
"median_ttft": 1.5855348815675825,
"std_ttft": 3.474301628694044,
"p99_ttft": 14.917164506865666,
"mean_tpot": 0.015703236791405027,
"mean_intvty": 63.68113869029458,
"median_tpot": 0.015735316653411838,
"median_intvty": 63.551310852277844,
"std_tpot": 0.0003540094641515767,
"std_intvty": 2824.7832367889146,
"p99_tpot": 0.016252625234104884,
"p99_intvty": 61.52852142936127,
"mean_itl": 0.7645554208340372,
"median_itl": 0.7905032066628337,
"std_itl": 0.10526090907916322,
"p99_itl": 0.838319455999881,
"mean_e2el": 17.361102391320674,
"median_e2el": 16.352030610200018,
"std_e2el": 3.6055543525386757,
"p99_e2el": 29.617001811456863
},
{
"hw": "gb200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 375.44385265466576,
"output_tput_per_gpu": 235.0241293000802,
"input_tput_per_gpu": 937.122746073008,
"mean_ttft": 0.8818457700675936,
"median_ttft": 0.5075267674401402,
"std_ttft": 1.0052275085955047,
"p99_ttft": 4.488622420388274,
"mean_tpot": 0.015236343188535794,
"mean_intvty": 65.63254631547187,
"median_tpot": 0.015224224721297452,
"median_intvty": 65.68478975491483,
"std_tpot": 0.0004815669204300915,
"std_intvty": 2076.554592053149,
"p99_tpot": 0.016259309393773378,
"p99_intvty": 61.503227214739965,
"mean_itl": 0.151543207105394,
"median_itl": 0.15180048602633178,
"std_itl": 0.013535187614350637,
"p99_itl": 0.18343118116259569,
"mean_e2el": 14.938425281729725,
"median_e2el": 14.871659737545997,
"std_e2el": 1.3740108012519834,
"p99_e2el": 18.8255235712789
},
{
"hw": "gb200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 143.071766166406,
"output_tput_per_gpu": 89.75132249249081,
"input_tput_per_gpu": 356.35354086206684,
"mean_ttft": 0.4675352429563645,
"median_ttft": 0.41597130754962564,
"std_ttft": 0.23888123270142553,
"p99_ttft": 1.1092449634615331,
"mean_tpot": 0.010116444799594112,
"mean_intvty": 98.8489553207587,
"median_tpot": 0.010115498959088955,
"median_intvty": 98.85819810217886,
"std_tpot": 0.00015296851693100168,
"std_intvty": 6537.292902245121,
"p99_tpot": 0.010483977397825205,
"p99_intvty": 95.38364706961691,
"mean_itl": 0.10068832283059029,
"median_itl": 0.10075037716887891,
"std_itl": 0.006747417271433746,
"p99_itl": 0.1183991161361337,
"mean_e2el": 9.859238451730926,
"median_e2el": 9.909748350270092,
"std_e2el": 0.7018233936482876,
"p99_e2el": 11.484354316941463
},
{
"hw": "gb200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 73.67802170293739,
"output_tput_per_gpu": 45.82468958301924,
"input_tput_per_gpu": 185.09135018260994,
"mean_ttft": 0.3560546513530426,
"median_ttft": 0.35162731888704,
"std_ttft": 0.11347185549516518,
"p99_ttft": 0.5749151017796248,
"mean_tpot": 0.010017714109746658,
"mean_intvty": 99.82317213734994,
"median_tpot": 0.01005080020599113,
"median_intvty": 99.49456555746826,
"std_tpot": 0.00014773423494103408,
"std_intvty": 6768.911758328292,
"p99_tpot": 0.010251017944846627,
"p99_intvty": 97.55128762629063,
"mean_itl": 0.09971388050086541,
"median_itl": 0.10039245709776878,
"std_itl": 0.005842907828226561,
"p99_itl": 0.10273956754244865,
"mean_e2el": 9.539702918275726,
"median_e2el": 9.542953822528943,
"std_e2el": 0.6401453629910774,
"p99_e2el": 10.69248531261459
},
{
"hw": "gb200",
"conc": 2,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 20.51759415879257,
"output_tput_per_gpu": 12.868888020687981,
"input_tput_per_gpu": 51.112418711210914,
"mean_ttft": 0.27287404574453833,
"median_ttft": 0.23262554011307657,
"std_ttft": 0.08944721279650793,
"p99_ttft": 0.42891968081705273,
"mean_tpot": 0.009171944777994975,
"mean_intvty": 109.02813135107036,
"median_tpot": 0.008994850728096313,
"median_intvty": 111.17471876174667,
"std_tpot": 0.0003122141709719296,
"std_intvty": 3202.9295687859967,
"p99_tpot": 0.009794950796378144,
"p99_intvty": 102.09341739314992,
"mean_itl": 0.09132173424737679,
"median_itl": 0.08997211465612054,
"std_itl": 0.005892889933495603,
"p99_itl": 0.09840347696095705,
"mean_e2el": 8.720134306838736,
"median_e2el": 8.58696065703407,
"std_e2el": 0.5746503898230754,
"p99_e2el": 9.665802180082537
},
{
"hw": "gb200",
"conc": 128,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 563.1459766828197,
"output_tput_per_gpu": 351.886906524398,
"input_tput_per_gpu": 1408.1822573165064,
"mean_ttft": 1.3760504107027374,
"median_ttft": 0.6143705889116973,
"std_ttft": 1.9890698955987054,
"p99_ttft": 8.626800129795447,
"mean_tpot": 0.01986763130984657,
"mean_intvty": 50.333126501315306,
"median_tpot": 0.020065379352192964,
"median_intvty": 49.83708418603654,
"std_tpot": 0.0008564779973943886,
"std_intvty": 1167.5723171432771,
"p99_tpot": 0.021062287549440006,
"p99_intvty": 47.47822370446117,
"mean_itl": 0.19769979535889773,
"median_itl": 0.19998887088149786,
"std_itl": 0.018493001569410562,
"p99_itl": 0.22282534037716686,
"mean_e2el": 19.667914975996972,
"median_e2el": 19.244546780129895,
"std_e2el": 2.5048592270609795,
"p99_e2el": 27.840543717113324
},
{
"hw": "gb200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 38.26950596982279,
"output_tput_per_gpu": 24.139119630325858,
"input_tput_per_gpu": 94.7910513278105,
"mean_ttft": 0.24327271534129977,
"median_ttft": 0.20146217802539468,
"std_ttft": 0.07447981728113783,
"p99_ttft": 0.3985054380632937,
"mean_tpot": 0.009581933727665897,
"mean_intvty": 104.36306787561075,
"median_tpot": 0.009830279452508943,
"median_intvty": 101.72650786084967,
"std_tpot": 0.00045644521595466646,
"std_intvty": 2190.843424458892,
"p99_tpot": 0.010134376502812816,
"p99_intvty": 98.67405258946596,
"mean_itl": 0.0955243845237419,
"median_itl": 0.09841511212289333,
"std_itl": 0.006479038414704534,
"p99_itl": 0.10250010620802641,
"mean_e2el": 9.203459822060541,
"median_e2el": 9.3679274674505,
"std_e2el": 0.764437888638556,
"p99_e2el": 10.308343569058925
},
{
"hw": "gb200",
"conc": 2048,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 5599.4549134595045,
"output_tput_per_gpu": 1400.1508780800655,
"input_tput_per_gpu": 8958.898141763057,
"mean_ttft": 13.233085852793328,
"median_ttft": 11.126394821563736,
"std_ttft": 8.278085633460465,
"p99_ttft": 38.5047320000129,
"mean_tpot": 0.025310788830443715,
"mean_intvty": 39.50884370688613,
"median_tpot": 0.02561355912132794,
"median_intvty": 39.04182137527769,
"std_tpot": 0.0008031765963437118,
"std_intvty": 1245.056198789014,
"p99_tpot": 0.02602812437722673,
"p99_intvty": 38.41997930803453,
"mean_itl": 1.2330233341031456,
"median_itl": 1.2741784744430333,
"std_itl": 0.1708768179069946,
"p99_itl": 1.3643192769493908,
"mean_e2el": 36.53963502088995,
"median_e2el": 34.59476164332591,
"std_e2el": 8.424562014231682,
"p99_e2el": 61.788403905010846
},
{
"hw": "gb200",
"conc": 512,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 2355.1135991344686,
"output_tput_per_gpu": 588.9477312739256,
"input_tput_per_gpu": 3768.0462934229026,
"mean_ttft": 3.425558817842102,
"median_ttft": 2.639367827679962,
"std_ttft": 2.534635329175326,
"p99_ttft": 11.234224612112156,
"mean_tpot": 0.021507627392777395,
"mean_intvty": 46.49513317939551,
"median_tpot": 0.021536295730561195,
"median_intvty": 46.43324054010573,
"std_tpot": 0.0001519297739204557,
"std_intvty": 6581.988337082365,
"p99_tpot": 0.021771255759709614,
"p99_intvty": 45.932123118530576,
"mean_itl": 1.0471515370221602,
"median_itl": 1.0691044512204826,
"std_itl": 0.14008498063816677,
"p99_itl": 1.143286517560482,
"mean_e2el": 23.235129742783464,
"median_e2el": 22.689431889681146,
"std_e2el": 2.886579203705529,
"p99_e2el": 31.71446864702273
},
{
"hw": "gb200",
"conc": 6144,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 6265.352076337932,
"output_tput_per_gpu": 1566.2544224221492,
"input_tput_per_gpu": 10024.63019947056,
"mean_ttft": 74.96164737716937,
"median_ttft": 77.32922788779251,
"std_ttft": 18.964813603324465,
"p99_ttft": 109.80713061378343,
"mean_tpot": 0.02554612090948383,
"mean_intvty": 39.1448863623266,
"median_tpot": 0.025736523578422967,
"median_intvty": 38.855286610596536,
"std_tpot": 0.0009436216917726257,
"std_intvty": 1059.746727654666,
"p99_tpot": 0.026220846524514313,
"p99_intvty": 38.13759403477241,
"mean_itl": 1.244171691409493,
"median_itl": 1.2819455917924643,
"std_itl": 0.19954723822378284,
"p99_itl": 1.4902407355606555,
"mean_e2el": 98.48131178130187,
"median_e2el": 101.02934135007672,
"std_e2el": 19.30007315341342,
"p99_e2el": 132.88935459087142
},
{
"hw": "gb200",
"conc": 1024,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 40,
"num_decode_gpu": 32,
"tput_per_gpu": 4075.459402828448,
"output_tput_per_gpu": 1018.8566263739281,
"input_tput_per_gpu": 6520.741623992064,
"mean_ttft": 4.896853416778595,
"median_ttft": 3.245639310684055,
"std_ttft": 4.356257526226778,
"p99_ttft": 19.664627267494797,
"mean_tpot": 0.023402115874897107,
"mean_intvty": 42.731178896207254,
"median_tpot": 0.023559723827725677,
"median_intvty": 42.4453192793022,
"std_tpot": 0.0004644578379186158,
"std_intvty": 2153.0479590597934,
"p99_tpot": 0.023956553878784488,
"p99_intvty": 41.74223075070839,
"mean_itl": 1.1391745503239032,
"median_itl": 1.1698580416850746,
"std_itl": 0.15694369827711718,
"p99_itl": 1.2542010499443859,
"mean_e2el": 26.452616772328838,
"median_e2el": 25.148906852584332,
"std_e2el": 4.593142413185315,
"p99_e2el": 41.89651944483631
},
{
"hw": "gb200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 2,
"num_prefill_gpu": 4,
"num_decode_gpu": 8,
"tput_per_gpu": 447.0460893514149,
"output_tput_per_gpu": 333.509406568345,
"input_tput_per_gpu": 674.1194549175547,
"mean_ttft": 0.7109239787416299,
"median_ttft": 0.480608626967296,
"std_ttft": 0.5180063034299224,
"p99_ttft": 2.214952013674192,
"mean_tpot": 0.01069242265020255,
"mean_intvty": 93.52417433490217,
"median_tpot": 0.010701102880099301,
"median_intvty": 93.44831193611705,
"std_tpot": 0.0003202247660211241,
"std_intvty": 3122.806559983663,
"p99_tpot": 0.011238689652109423,
"p99_intvty": 88.97834453612722,
"mean_itl": 0.10639879431507028,
"median_itl": 0.10645183734595776,
"std_itl": 0.008778111365103316,
"p99_itl": 0.12890150891616942,
"mean_e2el": 10.492298024031333,
"median_e2el": 10.498955344315618,
"std_e2el": 0.9025362342362084,
"p99_e2el": 12.539556746534071
},
{
"hw": "gb200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 2,
"num_prefill_gpu": 4,
"num_decode_gpu": 8,
"tput_per_gpu": 82.65188617807556,
"output_tput_per_gpu": 62.56084213988763,
"input_tput_per_gpu": 122.8339742544514,
"mean_ttft": 0.22575995295774193,
"median_ttft": 0.21467791497707367,
"std_ttft": 0.05138918762303447,
"p99_ttft": 0.3582327732024714,
"mean_tpot": 0.007488533579132299,
"mean_intvty": 133.53749294609835,
"median_tpot": 0.007497622244463874,
"median_intvty": 133.37561794852817,
"std_tpot": 5.9797879873385174e-05,
"std_intvty": 16723.00091771447,
"p99_tpot": 0.007563853008666901,
"p99_intvty": 132.207751638506,
"mean_itl": 0.07458723335167461,
"median_itl": 0.07494980306364596,
"std_itl": 0.003496905351928491,
"p99_itl": 0.07588093949016184,
"mean_e2el": 7.222042228467762,
"median_e2el": 7.315442324848846,
"std_e2el": 0.43357066130846833,
"p99_e2el": 7.8213283482799305
},
{
"hw": "gb200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 2,
"num_prefill_gpu": 4,
"num_decode_gpu": 8,
"tput_per_gpu": 681.3141522007851,
"output_tput_per_gpu": 511.79508499694253,
"input_tput_per_gpu": 1020.3522866084703,
"mean_ttft": 0.6729042503138771,
"median_ttft": 0.4711622374597937,
"std_ttft": 0.6022211697428648,
"p99_ttft": 2.8237204687250776,
"mean_tpot": 0.01426673833245062,
"mean_intvty": 70.09310584504345,
"median_tpot": 0.014261253649807805,
"median_intvty": 70.12006269262848,
"std_tpot": 0.00047128618159581213,
"std_intvty": 2121.85300365464,
"p99_tpot": 0.015048190125560966,
"p99_intvty": 66.45317421271763,
"mean_itl": 0.14191462228522855,
"median_itl": 0.1436282100621611,
"std_itl": 0.011835677476357078,
"p99_itl": 0.1565247353259474,
"mean_e2el": 13.83637233835907,
"median_e2el": 13.821404171641916,
"std_e2el": 1.0827867273385017,
"p99_e2el": 16.57777675488498
},
{
"hw": "gb200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5.post2",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 2,
"num_prefill_gpu": 4,
"num_decode_gpu": 8,
"tput_per_gpu": 165.40808255595294,
"output_tput_per_gpu": 123.4524032397592,
"input_tput_per_gpu": 249.31944118834042,
"mean_ttft": 0.2937907711253501,
"median_ttft": 0.2966091933194548,
"std_ttft": 0.0640808702440623,
"p99_ttft": 0.41193882464431225,
"mean_tpot": 0.007471419681128073,
"mean_intvty": 133.8433714981749,
"median_tpot": 0.007468156203853823,
"median_intvty": 133.90185913411477,
"std_tpot": 5.908044944980207e-05,
"std_intvty": 16926.072995596518,
"p99_tpot": 0.00760405357353736,
"p99_intvty": 131.50880518255028,
"mean_itl": 0.07437089038709427,
"median_itl": 0.07463834690861404,
"std_itl": 0.004356671265132124,
"p99_itl": 0.08058916965965185,
"mean_e2el": 7.1433496549492705,
"median_e2el": 7.219220030121505,
"std_e2el": 0.4649305346507448,
"p99_e2el": 7.892579311686568
}
]
[
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 641.2799367245719,
"output_tput_per_gpu": 321.14790576674005,
"input_tput_per_gpu": 320.1320309578319,
"mean_ttft": 7.429033456041362,
"median_ttft": 0.4049194911494851,
"std_ttft": 21.507662512924416,
"p99_ttft": 73.2787645055307,
"mean_tpot": 0.016500912105395376,
"mean_intvty": 60.602710541863054,
"median_tpot": 0.016331182684193535,
"median_intvty": 61.23255243282963,
"std_tpot": 0.004724498328625295,
"std_intvty": 211.6626846793644,
"p99_tpot": 0.01710460220207336,
"p99_intvty": 58.46379753156629,
"mean_itl": 0.16403007508055295,
"median_itl": 0.14787723892368376,
"std_itl": 0.4242816317900714,
"p99_itl": 0.25230101314838976,
"mean_e2el": 22.643847975885727,
"median_e2el": 15.617675733519718,
"std_e2el": 21.843125195522077,
"p99_e2el": 90.13985173074529
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 204.9070644391866,
"output_tput_per_gpu": 101.91109312229496,
"input_tput_per_gpu": 102.99597131689164,
"mean_ttft": 0.32274896054441343,
"median_ttft": 0.162723705987446,
"std_ttft": 0.5147236327123939,
"p99_ttft": 2.621198953207931,
"mean_tpot": 0.01887051264378247,
"mean_intvty": 52.992730980707286,
"median_tpot": 0.018958024481654227,
"median_intvty": 52.74811207083865,
"std_tpot": 0.000609820012909863,
"std_intvty": 1639.8281112952081,
"p99_tpot": 0.020053270736644223,
"p99_intvty": 49.86717693750856,
"mean_itl": 0.18765400995375867,
"median_itl": 0.17171890201279894,
"std_itl": 0.05690608556211947,
"p99_itl": 0.3957180514396168,
"mean_e2el": 17.59278183688748,
"median_e2el": 17.577111709993915,
"std_e2el": 1.4441395601241613,
"p99_e2el": 21.31427635858592
},
{
"hw": "gb200",
"conc": 4300,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 2,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 8,
"num_decode_gpu": 16,
"tput_per_gpu": 8772.066417069726,
"output_tput_per_gpu": 6576.648712718563,
"input_tput_per_gpu": 13162.901825772049,
"mean_ttft": 4.417850822288784,
"median_ttft": 2.6825136054540053,
"std_ttft": 5.863712313142769,
"p99_ttft": 29.946521612133367,
"mean_tpot": 0.033614227161874864,
"mean_intvty": 29.749308088635647,
"median_tpot": 0.03357622255888582,
"median_intvty": 29.78298104398744,
"std_tpot": 0.0015633774147045258,
"std_intvty": 639.6408126370416,
"p99_tpot": 0.037157022405837896,
"p99_intvty": 26.912813117201928,
"mean_itl": 0.6646102327497304,
"median_itl": 0.5519979861564934,
"std_itl": 0.3518335787505597,
"p99_itl": 1.7907296301797035,
"mean_e2el": 35.35489678359931,
"median_e2el": 34.07474009040743,
"std_e2el": 6.129267594599042,
"p99_e2el": 60.59256729653575
},
{
"hw": "gb200",
"conc": 2048,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 2,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 8,
"num_decode_gpu": 16,
"tput_per_gpu": 6197.878317247915,
"output_tput_per_gpu": 4649.727217836739,
"input_tput_per_gpu": 9294.180516070266,
"mean_ttft": 1.5304020429895218,
"median_ttft": 0.5354841455118731,
"std_ttft": 3.0229475730882416,
"p99_ttft": 14.324054803305772,
"mean_tpot": 0.024446319285584682,
"mean_intvty": 40.90595350236108,
"median_tpot": 0.0244035563480784,
"median_intvty": 40.977633986480114,
"std_tpot": 0.00038243990509568583,
"std_intvty": 2614.789896859225,
"p99_tpot": 0.02536731993753382,
"p99_intvty": 39.420798194782364,
"mean_itl": 0.48334837704451544,
"median_itl": 0.48330555483698845,
"std_itl": 0.14492835540308474,
"p99_itl": 0.8767040200764313,
"mean_e2el": 24.053155971355096,
"median_e2el": 23.445035883109085,
"std_e2el": 3.2839838888592268,
"p99_e2el": 36.94438854399603
},
{
"hw": "h200",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 456.71277457392324,
"output_tput_per_gpu": 228.30491853874665,
"input_tput_per_gpu": 228.4078560351766,
"mean_ttft": 0.9947605113498866,
"median_ttft": 0.13977374229580164,
"std_ttft": 2.71762373044683,
"p99_ttft": 12.339342405293138,
"mean_tpot": 0.033114762817580495,
"mean_intvty": 30.198011850747847,
"median_tpot": 0.03312615881115265,
"median_intvty": 30.18762319231918,
"std_tpot": 0.0016327572044284732,
"std_intvty": 612.4609325181559,
"p99_tpot": 0.039150759374931905,
"p99_intvty": 25.542288731193715,
"mean_itl": 0.32942854207200317,
"median_itl": 0.30126926489174366,
"std_itl": 0.11948468602180282,
"p99_itl": 0.5422460364550352,
"mean_e2el": 31.50653484758659,
"median_e2el": 30.844364249147475,
"std_e2el": 4.088792376153664,
"p99_e2el": 45.05541437757201
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 499.07528254880015,
"output_tput_per_gpu": 249.30997391797308,
"input_tput_per_gpu": 249.76530863082706,
"mean_ttft": 0.23429270649787212,
"median_ttft": 0.13443481949798297,
"std_ttft": 0.4981070310754062,
"p99_ttft": 3.693676319500664,
"mean_tpot": 0.0316819662921726,
"mean_intvty": 31.563697492067018,
"median_tpot": 0.031497857283734415,
"median_intvty": 31.748191344952307,
"std_tpot": 0.0013645471132767524,
"std_intvty": 732.8438793136662,
"p99_tpot": 0.03544520550343103,
"p99_intvty": 28.212560367387397,
"mean_itl": 0.0316887455722762,
"median_itl": 0.02808956999797374,
"std_itl": 0.020226387532098964,
"p99_itl": 0.09138488282158501,
"mean_e2el": 29.373262552226716,
"median_e2el": 29.33710997349408,
"std_e2el": 2.402231403875462,
"p99_e2el": 35.906585603464045
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 306.0451682972859,
"output_tput_per_gpu": 152.7598980530619,
"input_tput_per_gpu": 153.285270244224,
"mean_ttft": 0.1987650581939306,
"median_ttft": 0.12773075599397998,
"std_ttft": 0.26848966213290126,
"p99_ttft": 1.9722618721981418,
"mean_tpot": 0.025809103184290905,
"mean_intvty": 38.74601890888889,
"median_tpot": 0.02573446974101467,
"median_intvty": 38.858387604786586,
"std_tpot": 0.0008283913234227633,
"std_intvty": 1207.1589497922075,
"p99_tpot": 0.027493696731644646,
"p99_intvty": 36.37197317481944,
"mean_itl": 0.02581056827574398,
"median_itl": 0.024163800000678748,
"std_itl": 0.015743161240447973,
"p99_itl": 0.08529472369060387,
"mean_e2el": 23.935066793745317,
"median_e2el": 24.009372471511597,
"std_e2el": 1.7058797348005414,
"p99_e2el": 27.448370655650503
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 897.4092021244418,
"output_tput_per_gpu": 449.4154102966225,
"input_tput_per_gpu": 447.9937918278193,
"mean_ttft": 0.23239482318022056,
"median_ttft": 0.09115337789990008,
"std_ttft": 0.4398902793603599,
"p99_ttft": 2.3007365879649297,
"mean_tpot": 0.01717118986287973,
"mean_intvty": 58.237082461115655,
"median_tpot": 0.017297669240869083,
"median_intvty": 57.81125688525174,
"std_tpot": 0.0005474466058495037,
"std_intvty": 1826.6621608663436,
"p99_tpot": 0.01781354143178208,
"p99_intvty": 56.13706874792719,
"mean_itl": 0.17075729934448494,
"median_itl": 0.15842054528184235,
"std_itl": 0.03812959581703131,
"p99_itl": 0.3030505743343383,
"mean_e2el": 16.080806504502835,
"median_e2el": 16.073315837420523,
"std_e2el": 1.2719675075158121,
"p99_e2el": 19.26560603603255
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 111.19512760365721,
"output_tput_per_gpu": 55.563619705252165,
"input_tput_per_gpu": 55.63150789840505,
"mean_ttft": 0.16721754746278747,
"median_ttft": 0.1576263221213594,
"std_ttft": 0.07806530207730589,
"p99_ttft": 0.6859068887284957,
"mean_tpot": 0.008764168852895795,
"mean_intvty": 114.10095090415642,
"median_tpot": 0.008774242714577298,
"median_intvty": 113.96994960472499,
"std_tpot": 0.00011516409095233161,
"std_intvty": 8683.262219418006,
"p99_tpot": 0.009054326213521549,
"p99_intvty": 110.44444129996334,
"mean_itl": 0.2586013330775246,
"median_itl": 0.2543260745005682,
"std_itl": 0.040273445530031114,
"p99_itl": 0.35306476534111425,
"mean_e2el": 8.227820904346881,
"median_e2el": 8.210300613427535,
"std_e2el": 0.5410468130695615,
"p99_e2el": 9.210824889196083
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 229.93456671512357,
"output_tput_per_gpu": 114.40785081455563,
"input_tput_per_gpu": 115.52671590056794,
"mean_ttft": 0.13291599152435082,
"median_ttft": 0.10962579149054363,
"std_ttft": 0.07607168375939898,
"p99_ttft": 0.423498869598261,
"mean_tpot": 0.008351382659914789,
"mean_intvty": 119.74065142528187,
"median_tpot": 0.008351485050254043,
"median_intvty": 119.73918338865745,
"std_tpot": 0.0001518442357067412,
"std_intvty": 6585.696159920838,
"p99_tpot": 0.008551696796468116,
"p99_intvty": 116.93585773679486,
"mean_itl": 0.08297909140369604,
"median_itl": 0.08138331401278265,
"std_itl": 0.014840220931810479,
"p99_itl": 0.1608087609143695,
"mean_e2el": 7.789811460523924,
"median_e2el": 7.800381242501317,
"std_e2el": 0.5647767090453408,
"p99_e2el": 8.863045023245213
},
{
"hw": "b200",
"conc": 128,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1765.9572259169158,
"output_tput_per_gpu": 882.0097961255168,
"input_tput_per_gpu": 883.947429791399,
"mean_ttft": 7.9145328841425,
"median_ttft": 0.6534272480057552,
"std_ttft": 22.013923569348933,
"p99_ttft": 75.18259847427485,
"mean_tpot": 0.026936163368016776,
"mean_intvty": 37.12481196143068,
"median_tpot": 0.02710567983928478,
"median_intvty": 36.892636743634846,
"std_tpot": 0.0024170482056075464,
"std_intvty": 413.7277848575805,
"p99_tpot": 0.028612954982715915,
"p99_intvty": 34.94920397435585,
"mean_itl": 0.26823513125919995,
"median_itl": 0.234339507878758,
"std_itl": 0.23480635432712735,
"p99_itl": 0.5736454221210443,
"mean_e2el": 32.682275038535224,
"median_e2el": 25.61190285696648,
"std_e2el": 22.249872590635338,
"p99_e2el": 102.19253178208368
},
{
"hw": "gb200",
"conc": 2048,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 3,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 2254.58332699924,
"output_tput_per_gpu": 1690.9872901005592,
"input_tput_per_gpu": 3381.7754007966014,
"mean_ttft": 2.214189830340456,
"median_ttft": 1.055184896918945,
"std_ttft": 2.408734280176251,
"p99_ttft": 8.548036242406816,
"mean_tpot": 0.021043439521847215,
"mean_intvty": 47.52074863815889,
"median_tpot": 0.021092488528798803,
"median_intvty": 47.410242685904116,
"std_tpot": 0.00032801274110838866,
"std_intvty": 3048.6620630067523,
"p99_tpot": 0.021615610726902403,
"p99_intvty": 46.26286125496412,
"mean_itl": 1.0251579416142114,
"median_itl": 1.037793020484969,
"std_itl": 0.1440633617716875,
"p99_itl": 1.228154654381797,
"mean_e2el": 21.594880694794846,
"median_e2el": 21.071244221995585,
"std_e2el": 2.6360387943642634,
"p99_e2el": 28.974507761406716
},
{
"hw": "gb200",
"conc": 4096,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 3,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 3819.240348384917,
"output_tput_per_gpu": 2865.058363518239,
"input_tput_per_gpu": 5727.604318118273,
"mean_ttft": 3.4467532725996763,
"median_ttft": 1.2692942654248327,
"std_ttft": 4.679917411303755,
"p99_ttft": 17.442662313824986,
"mean_tpot": 0.023834804036256062,
"mean_intvty": 41.955452978713836,
"median_tpot": 0.024033075363934803,
"median_intvty": 41.60932318718762,
"std_tpot": 0.0007782127221459672,
"std_intvty": 1284.9956978889286,
"p99_tpot": 0.025030093990295838,
"p99_intvty": 39.95190750732697,
"mean_itl": 1.160987070265976,
"median_itl": 1.1759424770716578,
"std_itl": 0.19865541049288465,
"p99_itl": 1.7058381702192131,
"mean_e2el": 25.40658549897133,
"median_e2el": 23.887357637868263,
"std_e2el": 4.934672434215348,
"p99_e2el": 40.410193555047734
},
{
"hw": "gb200",
"conc": 1024,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 3,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 48,
"tput_per_gpu": 596.52577507071,
"output_tput_per_gpu": 447.60049140073176,
"input_tput_per_gpu": 894.3763424106663,
"mean_ttft": 22.808323376209593,
"median_ttft": 0.6381742671364918,
"std_ttft": 37.989100830794655,
"p99_ttft": 144.71110697851518,
"mean_tpot": 0.019870145474881782,
"mean_intvty": 50.32675786214643,
"median_tpot": 0.019904019936956542,
"median_intvty": 50.24110723197491,
"std_tpot": 0.00030198810316582446,
"std_intvty": 3311.388725306476,
"p99_tpot": 0.02050075042601824,
"p99_intvty": 48.778702204523405,
"mean_itl": 0.9672413699327611,
"median_itl": 0.9874429774936289,
"std_itl": 0.13335430569667675,
"p99_itl": 1.1014580716658382,
"mean_e2el": 41.115633187459935,
"median_e2el": 19.922846795874648,
"std_e2el": 37.93415756708192,
"p99_e2el": 163.28720403442392
},
{
"hw": "h200",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 295.81460558317224,
"output_tput_per_gpu": 148.14160811497888,
"input_tput_per_gpu": 147.67299746819336,
"mean_ttft": 0.47806689820135945,
"median_ttft": 0.12522324442397803,
"std_ttft": 1.0877645907867788,
"p99_ttft": 4.820350355247501,
"mean_tpot": 0.025896730488496424,
"mean_intvty": 38.614913200884935,
"median_tpot": 0.025927905091378087,
"median_intvty": 38.5684842826941,
"std_tpot": 0.0008744587659171604,
"std_intvty": 1143.5644983799414,
"p99_tpot": 0.027313797021867186,
"p99_intvty": 36.61153369483594,
"mean_itl": 0.25752916184175595,
"median_itl": 0.24227966740727425,
"std_itl": 0.05412672899346776,
"p99_itl": 0.39470253719249737,
"mean_e2el": 24.37999208016263,
"median_e2el": 24.36240943497978,
"std_e2el": 2.1026897143543195,
"p99_e2el": 30.27027516528964
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 123.9594290912726,
"output_tput_per_gpu": 62.20946588953585,
"input_tput_per_gpu": 61.74996320173675,
"mean_ttft": 0.20791010695975273,
"median_ttft": 0.13392320391722023,
"std_ttft": 0.23873498713961971,
"p99_ttft": 1.2763758396357294,
"mean_tpot": 0.015510140992766677,
"mean_intvty": 64.47394646292132,
"median_tpot": 0.015545382547207966,
"median_intvty": 64.32778331206815,
"std_tpot": 0.0002790458830966927,
"std_intvty": 3583.6400412095963,
"p99_tpot": 0.016258878910921846,
"p99_intvty": 61.50485562250257,
"mean_itl": 0.15420951023277643,
"median_itl": 0.149389476981014,
"std_itl": 0.030909557339096977,
"p99_itl": 0.2470420343056321,
"mean_e2el": 14.60529533646186,
"median_e2el": 14.643656376749277,
"std_e2el": 1.0323764703938179,
"p99_e2el": 16.53920134677086
},
{
"hw": "mi325x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 44.127972793849686,
"output_tput_per_gpu": 21.95662270476474,
"input_tput_per_gpu": 22.171350089084946,
"mean_ttft": 0.18732226876018104,
"median_ttft": 0.13417264650342986,
"std_ttft": 0.1871872933414264,
"p99_ttft": 0.8453631609748118,
"mean_tpot": 0.02184903528048016,
"mean_intvty": 45.768611161216626,
"median_tpot": 0.021843753863896573,
"median_intvty": 45.77967716678969,
"std_tpot": 0.00016489244187872643,
"std_intvty": 6064.559349151193,
"p99_tpot": 0.02233645603919966,
"p99_intvty": 44.76985956254819,
"mean_itl": 0.02184876543590368,
"median_itl": 0.021517455927096307,
"std_itl": 0.005299868380454074,
"p99_itl": 0.024133532755076886,
"mean_e2el": 20.2166304375045,
"median_e2el": 20.32324968004832,
"std_e2el": 1.4052141974331072,
"p99_e2el": 22.871388312890193
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 140.00698736060681,
"output_tput_per_gpu": 70.19094865775841,
"input_tput_per_gpu": 69.8160387028484,
"mean_ttft": 0.18259024413724545,
"median_ttft": 0.1001861539989477,
"std_ttft": 0.17596723566911493,
"p99_ttft": 1.0825899649425992,
"mean_tpot": 0.013947495535800778,
"mean_intvty": 71.69745976495746,
"median_tpot": 0.013855465332991,
"median_intvty": 72.17368568769162,
"std_tpot": 0.0003825080453294319,
"std_intvty": 2614.324096474254,
"p99_tpot": 0.015049243699052927,
"p99_intvty": 66.44852193223049,
"mean_itl": 0.0139489930462004,
"median_itl": 0.01299674999609124,
"std_itl": 0.01347560963089915,
"p99_itl": 0.017627689233922498,
"mean_e2el": 13.089244616760334,
"median_e2el": 13.033233756999834,
"std_e2el": 0.9139494301676425,
"p99_e2el": 15.35890747389174
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 220.9558926044314,
"output_tput_per_gpu": 110.88747475552128,
"input_tput_per_gpu": 110.06841784891012,
"mean_ttft": 0.12296479816432111,
"median_ttft": 0.08433223608881235,
"std_ttft": 0.12065941863024443,
"p99_ttft": 0.650510452305898,
"mean_tpot": 0.008710679725442714,
"mean_intvty": 114.8016034935983,
"median_tpot": 0.008731782431272827,
"median_intvty": 114.52415447486483,
"std_tpot": 0.0001848908743781026,
"std_intvty": 5408.595764196538,
"p99_tpot": 0.008956694505365716,
"p99_intvty": 111.64833180376162,
"mean_itl": 0.0866201442164717,
"median_itl": 0.08309970516711473,
"std_itl": 0.01830391695428319,
"p99_itl": 0.147244536653161,
"mean_e2el": 8.210037848516368,
"median_e2el": 8.292171580484137,
"std_e2el": 0.5918520767172697,
"p99_e2el": 9.20223142792005
},
{
"hw": "h200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 492.19327296594696,
"output_tput_per_gpu": 246.04116929867362,
"input_tput_per_gpu": 246.15210366727334,
"mean_ttft": 0.3294624885340454,
"median_ttft": 0.14435125421732664,
"std_ttft": 0.5078944623912816,
"p99_ttft": 2.109193651378155,
"mean_tpot": 0.03143482613417446,
"mean_intvty": 31.811850834856287,
"median_tpot": 0.03206727878971416,
"median_intvty": 31.18443590295408,
"std_tpot": 0.0018697724283459685,
"std_intvty": 534.8244443226797,
"p99_tpot": 0.033419055362707145,
"p99_intvty": 29.923048067837243,
"mean_itl": 0.3130929929381459,
"median_itl": 0.2652675621211529,
"std_itl": 0.09325036784693132,
"p99_itl": 0.6344047609344124,
"mean_e2el": 29.297902333908134,
"median_e2el": 29.55287105590105,
"std_e2el": 2.662894988399935,
"p99_e2el": 33.959484479147946
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 282.93776002462636,
"output_tput_per_gpu": 141.22602757361327,
"input_tput_per_gpu": 141.7117324510131,
"mean_ttft": 0.3285720966418739,
"median_ttft": 0.30267622182145715,
"std_ttft": 0.18123222154615,
"p99_ttft": 1.3213041639281438,
"mean_tpot": 0.01369717992996946,
"mean_intvty": 73.00772897142117,
"median_tpot": 0.013752427774858886,
"median_intvty": 72.71443387095054,
"std_tpot": 0.0003402537420072122,
"std_intvty": 2938.9831074328154,
"p99_tpot": 0.014221395040691245,
"p99_intvty": 70.31658969733492,
"mean_itl": 0.4046281688127338,
"median_itl": 0.38193408865481615,
"std_itl": 0.0653815429299919,
"p99_itl": 0.5050839721411464,
"mean_e2el": 12.926164196388562,
"median_e2el": 12.94644465087913,
"std_e2el": 0.8653985453691946,
"p99_e2el": 14.583644766402431
},
{
"hw": "b200",
"conc": 128,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 1031.8263749958364,
"output_tput_per_gpu": 515.3471201288461,
"input_tput_per_gpu": 516.4792548669903,
"mean_ttft": 8.047447742966323,
"median_ttft": 0.5291580425109714,
"std_ttft": 22.739150392137557,
"p99_ttft": 77.40256422179519,
"mean_tpot": 0.02167746646369556,
"mean_intvty": 46.13085213047174,
"median_tpot": 0.02176623053086376,
"median_intvty": 45.94272759272832,
"std_tpot": 0.0026811437126168193,
"std_intvty": 372.975158061927,
"p99_tpot": 0.023366678003612758,
"p99_intvty": 42.79598494254889,
"mean_itl": 0.21577838987195336,
"median_itl": 0.18363446393050253,
"std_itl": 0.23810750019698657,
"p99_itl": 0.537630925734993,
"mean_e2el": 27.971547497398934,
"median_e2el": 20.622131744516082,
"std_e2el": 22.92224569320669,
"p99_e2el": 99.02907122727484
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 175.95277232368952,
"output_tput_per_gpu": 87.82536156954185,
"input_tput_per_gpu": 88.12741075414766,
"mean_ttft": 0.11519924342464946,
"median_ttft": 0.10491188800369855,
"std_ttft": 0.06769445317163822,
"p99_ttft": 0.6158818911685375,
"mean_tpot": 0.02244746672902508,
"mean_intvty": 44.54845671770069,
"median_tpot": 0.02245250594216652,
"median_intvty": 44.53845831622599,
"std_tpot": 0.000301557232503499,
"std_intvty": 3316.1200999826683,
"p99_tpot": 0.023019226943117633,
"p99_intvty": 43.44194539942982,
"mean_itl": 0.02244832270104978,
"median_itl": 0.021457998504047282,
"std_itl": 0.0063888792957696615,
"p99_itl": 0.06889111073454844,
"mean_e2el": 20.75946228062159,
"median_e2el": 20.800773798502632,
"std_e2el": 1.3378757943128958,
"p99_e2el": 23.093859420334628
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 542.1618666266606,
"output_tput_per_gpu": 269.6456982992158,
"input_tput_per_gpu": 272.51616832744486,
"mean_ttft": 0.19474708190791717,
"median_ttft": 0.11112058749131393,
"std_ttft": 0.26386422530551684,
"p99_ttft": 1.4121461816126244,
"mean_tpot": 0.014355811529131202,
"mean_intvty": 69.6582006507102,
"median_tpot": 0.014387425535050439,
"median_intvty": 69.50513818916487,
"std_tpot": 0.0005065612734389114,
"std_intvty": 1974.094847818237,
"p99_tpot": 0.0150497070429755,
"p99_intvty": 66.44647614365047,
"mean_itl": 0.14276671962343918,
"median_itl": 0.13329885000712238,
"std_itl": 0.03508060589322459,
"p99_itl": 0.2761659569863696,
"mean_e2el": 13.333746583119318,
"median_e2el": 13.328474082998582,
"std_e2el": 1.0801044653636331,
"p99_e2el": 15.695636622975579
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 620.9878513301078,
"output_tput_per_gpu": 310.5995673004801,
"input_tput_per_gpu": 310.3882840296277,
"mean_ttft": 0.19784741542786832,
"median_ttft": 0.1088224050035933,
"std_ttft": 0.3966024749883867,
"p99_ttft": 2.681221848351415,
"mean_tpot": 0.025376081027359058,
"mean_intvty": 39.407188167544724,
"median_tpot": 0.024855360171440733,
"median_intvty": 40.23277044076064,
"std_tpot": 0.0017309797317995841,
"std_intvty": 577.7075153620468,
"p99_tpot": 0.03262463249501408,
"p99_intvty": 30.651686272721292,
"mean_itl": 0.025385707519216687,
"median_itl": 0.019969038985436782,
"std_itl": 0.026318395158356062,
"p99_itl": 0.09190794019959866,
"mean_e2el": 23.575164543636063,
"median_e2el": 23.38487718749093,
"std_e2el": 2.420801068537152,
"p99_e2el": 32.063513461248625
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 634.0067710756045,
"output_tput_per_gpu": 316.9319368359403,
"input_tput_per_gpu": 317.07483423966426,
"mean_ttft": 7.22376137591832,
"median_ttft": 0.37949243550247047,
"std_ttft": 20.740413428655884,
"p99_ttft": 70.49232543275575,
"mean_tpot": 0.016875033929500426,
"mean_intvty": 59.259140110635876,
"median_tpot": 0.01690659974814581,
"median_intvty": 59.14849910075339,
"std_tpot": 0.0031217040798108392,
"std_intvty": 320.33785856492693,
"p99_tpot": 0.018047388284074196,
"p99_intvty": 55.4096794649475,
"mean_itl": 0.16803487778019416,
"median_itl": 0.1487200039846357,
"std_itl": 0.29149919278814873,
"p99_itl": 0.47566765745752493,
"mean_e2el": 22.770925787576562,
"median_e2el": 15.963349754514638,
"std_e2el": 21.0956728233861,
"p99_e2el": 87.7593521033763
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 182.06220418264795,
"output_tput_per_gpu": 91.27486468506198,
"input_tput_per_gpu": 90.78733949758596,
"mean_ttft": 0.16588181990533485,
"median_ttft": 0.14928729851089884,
"std_ttft": 0.09467247359957626,
"p99_ttft": 0.793000138529169,
"mean_tpot": 0.010705423357162165,
"mean_intvty": 93.41059822085205,
"median_tpot": 0.010717351393897706,
"median_intvty": 93.30663549665681,
"std_tpot": 0.00014529555709801375,
"std_intvty": 6882.522906914614,
"p99_tpot": 0.010999663943066394,
"p99_intvty": 90.91186832397248,
"mean_itl": 0.316213313779033,
"median_itl": 0.30522731100791134,
"std_itl": 0.04941966283408154,
"p99_itl": 0.4606577536062105,
"mean_e2el": 10.072054259582728,
"median_e2el": 10.05200670848717,
"std_e2el": 0.6347270474702321,
"p99_e2el": 11.211421411118353
},
{
"hw": "mi300x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 371.29733532210605,
"output_tput_per_gpu": 185.60682471264326,
"input_tput_per_gpu": 185.6905106094628,
"mean_ttft": 0.5166783945925999,
"median_ttft": 0.1663797835353762,
"std_ttft": 1.0318921658323659,
"p99_ttft": 4.326555775185116,
"mean_tpot": 0.04136769045838726,
"mean_intvty": 24.173454909355495,
"median_tpot": 0.04220044935812309,
"median_intvty": 23.69643013783482,
"std_tpot": 0.0024437828806327822,
"std_intvty": 409.20165532097695,
"p99_tpot": 0.04392209482570136,
"p99_intvty": 22.76758437793914,
"mean_itl": 0.04139217279887544,
"median_itl": 0.03621176490560174,
"std_itl": 0.026748115937057938,
"p99_itl": 0.13387840744107965,
"mean_e2el": 38.63117301587918,
"median_e2el": 38.84609429119155,
"std_e2el": 3.262300791385069,
"p99_e2el": 44.6767261645291
},
{
"hw": "gb200",
"conc": 512,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 2412.1309327995714,
"output_tput_per_gpu": 1357.0583904820062,
"input_tput_per_gpu": 10852.711271340095,
"mean_ttft": 1.2348787148280849,
"median_ttft": 0.4148194439476356,
"std_ttft": 2.200366709393639,
"p99_ttft": 9.825874333607498,
"mean_tpot": 0.009670138642223663,
"mean_intvty": 103.41113369704993,
"median_tpot": 0.009650568658963625,
"median_intvty": 103.62083679609717,
"std_tpot": 0.0011122349812344497,
"std_intvty": 899.0905850579505,
"p99_tpot": 0.012485798162104612,
"p99_intvty": 80.09099514639595,
"mean_itl": 0.4434240914943136,
"median_itl": 0.44113103090785444,
"std_itl": 0.12849161349339594,
"p99_itl": 0.6888551347143952,
"mean_e2el": 10.132914845539517,
"median_e2el": 9.352651417022571,
"std_e2el": 2.837751493419591,
"p99_e2el": 20.733681057707873
},
{
"hw": "mi325x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 411.88752912578764,
"output_tput_per_gpu": 205.8973473468454,
"input_tput_per_gpu": 205.99018177894223,
"mean_ttft": 0.4331379817704146,
"median_ttft": 0.14366331044584513,
"std_ttft": 0.8443757891982799,
"p99_ttft": 3.6977680043829606,
"mean_tpot": 0.037211900335084536,
"mean_intvty": 26.87312367805008,
"median_tpot": 0.03779024314733602,
"median_intvty": 26.46185673114659,
"std_tpot": 0.001843556770021934,
"std_intvty": 542.4297294561221,
"p99_tpot": 0.03946798171097013,
"p99_intvty": 25.336993599600508,
"mean_itl": 0.037236205271839656,
"median_itl": 0.032972196117043495,
"std_itl": 0.019487043197584968,
"p99_itl": 0.11140763387084021,
"mean_e2el": 34.72075919606213,
"median_e2el": 34.903472155099735,
"std_e2el": 2.8688129703284284,
"p99_e2el": 40.089818246290086
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 316.15706593132177,
"output_tput_per_gpu": 157.2415879149002,
"input_tput_per_gpu": 158.91547801642156,
"mean_ttft": 0.19592693944960046,
"median_ttft": 0.11224943700653967,
"std_ttft": 0.27968003475038467,
"p99_ttft": 1.481824649146001,
"mean_tpot": 0.012260266909942353,
"mean_intvty": 81.5642927960287,
"median_tpot": 0.012354373190594074,
"median_intvty": 80.94299763919581,
"std_tpot": 0.00044672944415771537,
"std_intvty": 2238.4913577511034,
"p99_tpot": 0.012946660824161195,
"p99_intvty": 77.23999366182433,
"mean_itl": 0.12192719178198376,
"median_itl": 0.11135585900046863,
"std_itl": 0.036943895333217014,
"p99_itl": 0.2694482011673972,
"mean_e2el": 11.417038633462472,
"median_e2el": 11.404057557490887,
"std_e2el": 0.9445595875237522,
"p99_e2el": 13.646970738812815
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 65.83950111446778,
"output_tput_per_gpu": 32.75956255216392,
"input_tput_per_gpu": 33.07993856230386,
"mean_ttft": 0.2134733396473166,
"median_ttft": 0.16361602349206805,
"std_ttft": 0.15198227124610095,
"p99_ttft": 0.786172540151747,
"mean_tpot": 0.014585688703470318,
"mean_intvty": 68.56035531336096,
"median_tpot": 0.014609090837043578,
"median_intvty": 68.45052927348138,
"std_tpot": 0.00021310625289578373,
"std_intvty": 4692.494877140158,
"p99_tpot": 0.014859207445867465,
"p99_intvty": 67.29834034843579,
"mean_itl": 0.1449061915521784,
"median_itl": 0.1421968379872851,
"std_itl": 0.024724391368815397,
"p99_itl": 0.2652863415918546,
"mean_e2el": 13.584691922573256,
"median_e2el": 13.677898031499353,
"std_e2el": 0.9525622134625409,
"p99_e2el": 15.508266156367027
},
{
"hw": "h200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 82.20657950322091,
"output_tput_per_gpu": 40.90328052080915,
"input_tput_per_gpu": 41.30329898241176,
"mean_ttft": 0.19577070961240678,
"median_ttft": 0.1558728776872158,
"std_ttft": 0.11544510834579869,
"p99_ttft": 0.5914431960135699,
"mean_tpot": 0.011667913389199234,
"mean_intvty": 85.70512709887623,
"median_tpot": 0.011701949067385746,
"median_intvty": 85.45584964021752,
"std_tpot": 0.00023957319327288688,
"std_intvty": 4174.08970652633,
"p99_tpot": 0.011933998961290833,
"p99_intvty": 83.79420873452429,
"mean_itl": 0.11614621101057483,
"median_itl": 0.11302625387907028,
"std_itl": 0.02490280305721305,
"p99_itl": 0.24921410814858974,
"mean_e2el": 10.892836503894069,
"median_e2el": 10.968858200125396,
"std_e2el": 0.7894751679459576,
"p99_e2el": 12.335861399509012
},
{
"hw": "mi300x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 44.84628442114787,
"output_tput_per_gpu": 22.314030860782047,
"input_tput_per_gpu": 22.532253560365824,
"mean_ttft": 0.2565912339836359,
"median_ttft": 0.1764916297979653,
"std_ttft": 0.28480466485570477,
"p99_ttft": 1.2572721256315709,
"mean_tpot": 0.02142861385915102,
"mean_intvty": 46.6665742624763,
"median_tpot": 0.021444322848459846,
"median_intvty": 46.632388770989856,
"std_tpot": 0.00022878034763899524,
"std_intvty": 4371.004810159453,
"p99_tpot": 0.02210502082906931,
"p99_intvty": 45.23859116590134,
"mean_itl": 0.021428176971848484,
"median_itl": 0.02100955881178379,
"std_itl": 0.008492415864153207,
"p99_itl": 0.021389827877283097,
"mean_e2el": 19.900336326356047,
"median_e2el": 20.007334575988352,
"std_e2el": 1.4051264818491935,
"p99_e2el": 22.784044573558493
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 296.6379693181095,
"output_tput_per_gpu": 148.18366482925754,
"input_tput_per_gpu": 148.454304488852,
"mean_ttft": 0.13541241095852455,
"median_ttft": 0.11120598450361285,
"std_ttft": 0.14229346180085098,
"p99_ttft": 1.139467287303123,
"mean_tpot": 0.026669320744572032,
"mean_intvty": 37.49626807437638,
"median_tpot": 0.026734878875288692,
"median_intvty": 37.404321323643984,
"std_tpot": 0.0003854001607907831,
"std_intvty": 2594.705715607774,
"p99_tpot": 0.027109898874429227,
"p99_intvty": 36.886895249293104,
"mean_itl": 0.02667614534533136,
"median_itl": 0.024693420011317357,
"std_itl": 0.009555510403225349,
"p99_itl": 0.07460135300061665,
"mean_e2el": 24.661993785221583,
"median_e2el": 24.6352058104967,
"std_e2el": 1.664611496660449,
"p99_e2el": 27.556570331401307
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 403.98555109201277,
"output_tput_per_gpu": 201.80848606971315,
"input_tput_per_gpu": 202.17706502229962,
"mean_ttft": 0.16468010068520017,
"median_ttft": 0.10207451799942646,
"std_ttft": 0.22698254128855616,
"p99_ttft": 1.6246460357803152,
"mean_tpot": 0.01951736207539238,
"mean_intvty": 51.23643226667433,
"median_tpot": 0.019135252145820007,
"median_intvty": 52.25956744021503,
"std_tpot": 0.0010971724550971225,
"std_intvty": 911.4337453098741,
"p99_tpot": 0.023190558551986772,
"p99_intvty": 43.120996752116966,
"mean_itl": 0.019526418013656087,
"median_itl": 0.01661322099971585,
"std_itl": 0.02090876477857112,
"p99_itl": 0.08577526139852125,
"mean_e2el": 18.119941349035706,
"median_e2el": 17.978715368997655,
"std_e2el": 1.704819452919009,
"p99_e2el": 23.524846079876298
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 75.55383157238532,
"output_tput_per_gpu": 37.753851767014105,
"input_tput_per_gpu": 37.799979805371215,
"mean_ttft": 0.1950170337059535,
"median_ttft": 0.18358874150726479,
"std_ttft": 0.17356631750535162,
"p99_ttft": 0.8757589224123464,
"mean_tpot": 0.012947859907705438,
"mean_intvty": 77.23284057196875,
"median_tpot": 0.012902648840497953,
"median_intvty": 77.50346555672105,
"std_tpot": 0.0002587035934702906,
"std_intvty": 3865.4275597251785,
"p99_tpot": 0.013931271001429813,
"p99_intvty": 71.78095953322325,
"mean_itl": 0.012948304819368695,
"median_itl": 0.012470546003896743,
"std_itl": 0.010048984615135668,
"p99_itl": 0.012806082313181836,
"mean_e2el": 12.104414425215946,
"median_e2el": 12.07736179100175,
"std_e2el": 0.863903277527204,
"p99_e2el": 13.789736369127054
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 853.7762593914858,
"output_tput_per_gpu": 426.79191431099144,
"input_tput_per_gpu": 426.9843450804944,
"mean_ttft": 0.3851088558010815,
"median_ttft": 0.09621681412681937,
"std_ttft": 0.8822966022008791,
"p99_ttft": 4.566773009821774,
"mean_tpot": 0.01784122875622258,
"mean_intvty": 56.049951136421846,
"median_tpot": 0.01801663246082011,
"median_intvty": 55.504268190776,
"std_tpot": 0.0009890226694180047,
"std_intvty": 1011.0991698384981,
"p99_tpot": 0.01990115349596598,
"p99_intvty": 50.24834365519078,
"mean_itl": 0.17756355077926556,
"median_itl": 0.1462849578820169,
"std_itl": 0.06702514515133351,
"p99_itl": 0.423486899472773,
"mean_e2el": 16.831100255452476,
"median_e2el": 16.804378338623792,
"std_e2el": 1.8351830520332975,
"p99_e2el": 22.284333304809408
},
{
"hw": "gb200",
"conc": 512,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 3803.0838216335337,
"output_tput_per_gpu": 2377.338614987014,
"input_tput_per_gpu": 9506.064648219613,
"mean_ttft": 1.0884636400741101,
"median_ttft": 0.32718132459558547,
"std_ttft": 2.091629423193156,
"p99_ttft": 9.353588376671542,
"mean_tpot": 0.011509891609823639,
"mean_intvty": 86.88179123654864,
"median_tpot": 0.011632329475121321,
"median_intvty": 85.96730363756916,
"std_tpot": 0.0010237054679599301,
"std_intvty": 976.8434684566344,
"p99_tpot": 0.013400882430482888,
"p99_intvty": 74.62195159068834,
"mean_itl": 0.5264395102407845,
"median_itl": 0.540485356003046,
"std_itl": 0.07010396636026237,
"p99_itl": 0.5593632255285047,
"mean_e2el": 11.679974086736138,
"median_e2el": 11.172180174500681,
"std_e2el": 2.3721075879442175,
"p99_e2el": 20.32006769160507
},
{
"hw": "gb200",
"conc": 1075,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 5604.6687381548945,
"output_tput_per_gpu": 3503.4906576192484,
"input_tput_per_gpu": 14009.381060297479,
"mean_ttft": 2.929410582676714,
"median_ttft": 1.9941336774500087,
"std_ttft": 2.9508642050011513,
"p99_ttft": 15.882445676790084,
"mean_tpot": 0.014985299876658552,
"mean_intvty": 66.7320646387346,
"median_tpot": 0.015225681108687592,
"median_intvty": 65.67850678479086,
"std_tpot": 0.0015733188805919138,
"std_intvty": 635.5990589929106,
"p99_tpot": 0.017748282728609687,
"p99_intvty": 56.34347927013979,
"mean_itl": 0.6921651997490522,
"median_itl": 0.709355709142983,
"std_itl": 0.10932932913412986,
"p99_itl": 0.9747483725799247,
"mean_e2el": 16.722411785457382,
"median_e2el": 16.128651749459095,
"std_e2el": 3.356755748706387,
"p99_e2el": 30.195009535551073
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 244.90560535733977,
"output_tput_per_gpu": 122.24259417377776,
"input_tput_per_gpu": 122.663011183562,
"mean_ttft": 0.16389791450812483,
"median_ttft": 0.09834074550599325,
"std_ttft": 0.1903269023558845,
"p99_ttft": 1.3221142903674628,
"mean_tpot": 0.016047165010289827,
"mean_intvty": 62.316303182448486,
"median_tpot": 0.01585998137675952,
"median_intvty": 63.051776433064006,
"std_tpot": 0.0006524999791487008,
"std_intvty": 1532.5670987831652,
"p99_tpot": 0.01831432546441761,
"p99_intvty": 54.602065576636825,
"mean_itl": 0.01604871906259868,
"median_itl": 0.014374136982951313,
"std_itl": 0.016872885345090353,
"p99_itl": 0.08327452082856326,
"mean_e2el": 14.922861486079965,
"median_e2el": 14.877197617010097,
"std_e2el": 1.1742886982132807,
"p99_e2el": 18.128650223699513
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 127.09445272502897,
"output_tput_per_gpu": 63.50842874187588,
"input_tput_per_gpu": 63.58602398315309,
"mean_ttft": 0.22369928099025857,
"median_ttft": 0.20476960801170208,
"std_ttft": 0.25455547776659093,
"p99_ttft": 1.3235857760004002,
"mean_tpot": 0.015406933129963014,
"mean_intvty": 64.90584411346768,
"median_tpot": 0.015378344757907196,
"median_intvty": 65.02650420070877,
"std_tpot": 0.00029232644196876235,
"std_intvty": 3420.833207099544,
"p99_tpot": 0.016509211088623917,
"p99_intvty": 60.572246283111305,
"mean_itl": 0.015407714691845254,
"median_itl": 0.014949633012292907,
"std_itl": 0.010987243014788517,
"p99_itl": 0.015541010827291757,
"mean_e2el": 14.395175752249342,
"median_e2el": 14.387992981515708,
"std_e2el": 1.024045564918163,
"p99_e2el": 16.45127784504089
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 53.50162023190284,
"output_tput_per_gpu": 26.734477888062617,
"input_tput_per_gpu": 26.767142343840224,
"mean_ttft": 0.10254724691141746,
"median_ttft": 0.09931363150826655,
"std_ttft": 0.020107927751740414,
"p99_ttft": 0.23979883406020236,
"mean_tpot": 0.01846855797035394,
"mean_intvty": 54.14607906070512,
"median_tpot": 0.018474835274947463,
"median_intvty": 54.12768152558501,
"std_tpot": 7.746174410811525e-05,
"std_intvty": 12909.59829931373,
"p99_tpot": 0.01859121090967565,
"p99_intvty": 53.78885780267049,
"mean_itl": 0.018468433167693556,
"median_itl": 0.01827680302085355,
"std_itl": 0.0028933581315699784,
"p99_itl": 0.01859799599274993,
"mean_e2el": 17.089165465495462,
"median_e2el": 17.158865097997477,
"std_e2el": 1.1077995214102774,
"p99_e2el": 18.995283529475564
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1096.2822242721327,
"output_tput_per_gpu": 548.0175678059866,
"input_tput_per_gpu": 548.2646564661461,
"mean_ttft": 7.17844623914525,
"median_ttft": 0.4653839475067798,
"std_ttft": 20.386307459943882,
"p99_ttft": 69.42139958190266,
"mean_tpot": 0.020794494405061315,
"mean_intvty": 48.08965202619224,
"median_tpot": 0.020746625585331274,
"median_intvty": 48.20060958284424,
"std_tpot": 0.0028751198593529735,
"std_intvty": 347.81158661852913,
"p99_tpot": 0.021941996993297568,
"p99_intvty": 45.57470317334659,
"mean_itl": 0.2071506984134482,
"median_itl": 0.18764059199020267,
"std_itl": 0.28859335818358,
"p99_itl": 0.5049551923532272,
"mean_e2el": 26.344740840698705,
"median_e2el": 19.711951009492623,
"std_e2el": 20.784469584212715,
"p99_e2el": 90.65216694450035
},
{
"hw": "h200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 218.9619801747013,
"output_tput_per_gpu": 108.90134419181408,
"input_tput_per_gpu": 110.06063598288722,
"mean_ttft": 0.2250255027145613,
"median_ttft": 0.16274974308907986,
"std_ttft": 0.16713294232046957,
"p99_ttft": 0.7348776398133486,
"mean_tpot": 0.017778192618412375,
"mean_intvty": 56.24868744893269,
"median_tpot": 0.017953584201255614,
"median_intvty": 55.69918456338446,
"std_tpot": 0.000767799496961416,
"std_intvty": 1302.4233591680156,
"p99_tpot": 0.01863887407773945,
"p99_intvty": 53.65130939933263,
"mean_itl": 0.1770341265797833,
"median_itl": 0.15951490867882967,
"std_itl": 0.05625378081156495,
"p99_itl": 0.42379419051110745,
"mean_e2el": 16.499993851641193,
"median_e2el": 16.441055288538337,
"std_e2el": 1.38702436282795,
"p99_e2el": 18.874534588856623
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 130.39207323866637,
"output_tput_per_gpu": 64.87879171717749,
"input_tput_per_gpu": 65.51328152148888,
"mean_ttft": 0.10384224493755027,
"median_ttft": 0.08351057092659175,
"std_ttft": 0.059050758293203054,
"p99_ttft": 0.3299985429877415,
"mean_tpot": 0.007367507908830015,
"mean_intvty": 135.73110641679696,
"median_tpot": 0.007388820195880784,
"median_intvty": 135.33960408963438,
"std_tpot": 0.00010658819974353541,
"std_intvty": 9381.90158391009,
"p99_tpot": 0.0075063634329180055,
"p99_intvty": 133.22030153971141,
"mean_itl": 0.07320066111980335,
"median_itl": 0.07186384731903672,
"std_itl": 0.011997052549359141,
"p99_itl": 0.13464706712402402,
"mean_e2el": 6.8584330716519615,
"median_e2el": 6.9107868985738605,
"std_e2el": 0.4856356363968179,
"p99_e2el": 7.7363094548135996
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 524.4126827001851,
"output_tput_per_gpu": 262.6217119264341,
"input_tput_per_gpu": 261.790970773751,
"mean_ttft": 0.28176771416274277,
"median_ttft": 0.10523877049854491,
"std_ttft": 0.5514158412173853,
"p99_ttft": 2.8611615708237514,
"mean_tpot": 0.014580355019460913,
"mean_intvty": 68.58543558543428,
"median_tpot": 0.014660147404230436,
"median_intvty": 68.21213814749454,
"std_tpot": 0.0006148540724014421,
"std_intvty": 1626.4021739244392,
"p99_tpot": 0.015716656868408684,
"p99_intvty": 63.626762890653495,
"mean_itl": 0.14502292385488288,
"median_itl": 0.12457800451375078,
"std_itl": 0.05289910028839434,
"p99_itl": 0.33641460732236733,
"mean_e2el": 13.741707663203396,
"median_e2el": 13.719763337488985,
"std_e2el": 1.291835065639702,
"p99_e2el": 17.680752003692323
},
{
"hw": "mi325x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 83.30705438438346,
"output_tput_per_gpu": 41.80793180539006,
"input_tput_per_gpu": 41.4991225789934,
"mean_ttft": 0.1667088986840099,
"median_ttft": 0.12355327396653593,
"std_ttft": 0.16357438312889708,
"p99_ttft": 0.9656905179191381,
"mean_tpot": 0.02320572207020305,
"mean_intvty": 43.09281982154025,
"median_tpot": 0.023210967987731892,
"median_intvty": 43.08308040097888,
"std_tpot": 0.0002524520409704655,
"std_intvty": 3961.1484072612056,
"p99_tpot": 0.023800257855761127,
"p99_intvty": 42.01635150595389,
"mean_itl": 0.02321012615910169,
"median_itl": 0.02265459089539945,
"std_itl": 0.007624513065071387,
"p99_itl": 0.05260278491768986,
"mean_e2el": 21.70744649678818,
"median_e2el": 21.906510022003204,
"std_e2el": 1.4592142843028952,
"p99_e2el": 24.032168783559463
},
{
"hw": "b200-trt",
"conc": 256,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "true",
"tput_per_gpu": 3435.18361133009,
"output_tput_per_gpu": 1718.4222304210036,
"input_tput_per_gpu": 1716.7613809090865,
"mean_ttft": 1.1643840026619727,
"median_ttft": 0.8834498869255185,
"std_ttft": 1.1624854838093355,
"p99_ttft": 5.811972907194868,
"mean_tpot": 0.03447258798480657,
"mean_intvty": 29.008556028364904,
"median_tpot": 0.034761480082379614,
"median_intvty": 28.76747473439412,
"std_tpot": 0.0011874833520894386,
"std_intvty": 842.1170690439138,
"p99_tpot": 0.03600821591107884,
"p99_intvty": 27.771439786671703,
"mean_itl": 0.3428813619008422,
"median_itl": 0.30835804296657443,
"std_itl": 0.07041792406889948,
"p99_itl": 0.6002769273146986,
"mean_e2el": 32.929797214319116,
"median_e2el": 32.91153870499693,
"std_e2el": 2.7731514659221888,
"p99_e2el": 39.17268681048881
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 365.5228885167477,
"output_tput_per_gpu": 183.4389189408438,
"input_tput_per_gpu": 182.0839695759039,
"mean_ttft": 0.21978793078742456,
"median_ttft": 0.16096528299385682,
"std_ttft": 0.1827035024344528,
"p99_ttft": 0.7498866373926285,
"mean_tpot": 0.010447272835718634,
"mean_intvty": 95.71875988354172,
"median_tpot": 0.01041591579162534,
"median_intvty": 96.00692056324277,
"std_tpot": 0.00035011992456693975,
"std_intvty": 2856.1642164092523,
"p99_tpot": 0.011324179912974196,
"p99_intvty": 88.30661537391265,
"mean_itl": 0.10398533259301215,
"median_itl": 0.09850755799561739,
"std_itl": 0.03151717884306649,
"p99_itl": 0.1843689631097369,
"mean_e2el": 9.919019663974177,
"median_e2el": 9.89528558299935,
"std_e2el": 0.7666360249072773,
"p99_e2el": 11.565644675162329
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 1125.7129527716302,
"output_tput_per_gpu": 562.2389021650472,
"input_tput_per_gpu": 563.474050606583,
"mean_ttft": 0.7920400589951896,
"median_ttft": 0.1966307605180191,
"std_ttft": 2.0921973819547266,
"p99_ttft": 10.783431073446994,
"mean_tpot": 0.026992523938468523,
"mean_intvty": 37.04729510584393,
"median_tpot": 0.027415059531668028,
"median_intvty": 36.47630233466637,
"std_tpot": 0.0021903253262609518,
"std_intvty": 456.55318322373347,
"p99_tpot": 0.031001554485203708,
"p99_intvty": 32.25644702678621,
"mean_itl": 0.2686317336462882,
"median_itl": 0.1916247960034525,
"std_itl": 0.12590846221249627,
"p99_itl": 0.7419374693455757,
"mean_e2el": 25.625784449898287,
"median_e2el": 25.4115537799953,
"std_e2el": 3.5966496141912545,
"p99_e2el": 37.37643152206612
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 817.5906995985575,
"output_tput_per_gpu": 408.9344372523261,
"input_tput_per_gpu": 408.6562623462314,
"mean_ttft": 0.26780416145786146,
"median_ttft": 0.14544924450456165,
"std_ttft": 0.6281365506682745,
"p99_ttft": 4.384220488306019,
"mean_tpot": 0.03861481596860886,
"mean_intvty": 25.89679569657745,
"median_tpot": 0.038087981672516535,
"median_intvty": 26.255001081393043,
"std_tpot": 0.0021292654234834345,
"std_intvty": 469.64553548426136,
"p99_tpot": 0.046742885398299544,
"p99_intvty": 21.393630099616807,
"mean_itl": 0.038622826603984264,
"median_itl": 0.03207420802209526,
"std_itl": 0.027712658850288827,
"p99_itl": 0.1306680386900555,
"mean_e2el": 35.83653053449152,
"median_e2el": 35.64918338600546,
"std_e2el": 3.283118759795647,
"p99_e2el": 46.87365877556817
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 515.684617321858,
"output_tput_per_gpu": 257.78419414518766,
"input_tput_per_gpu": 257.90042317667036,
"mean_ttft": 1.6282350063396733,
"median_ttft": 1.4507918141316622,
"std_ttft": 1.0682361391017958,
"p99_ttft": 4.793210491444916,
"mean_tpot": 0.027976202421423374,
"mean_intvty": 35.74466558885879,
"median_tpot": 0.02801925929453371,
"median_intvty": 35.689737172855615,
"std_tpot": 0.00040506612311662317,
"std_intvty": 2468.732739992894,
"p99_tpot": 0.02864856500094191,
"p99_intvty": 34.90576229445077,
"mean_itl": 0.27817329930642665,
"median_itl": 0.2683784537948668,
"std_itl": 0.038889263577688474,
"p99_itl": 0.4092893096618354,
"mean_e2el": 27.392732787023853,
"median_e2el": 27.443264302331954,
"std_e2el": 1.9662657146460998,
"p99_e2el": 32.067206466505304
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 506.58988065776214,
"output_tput_per_gpu": 253.38112073219884,
"input_tput_per_gpu": 253.2087599255633,
"mean_ttft": 0.15677339283993205,
"median_ttft": 0.11557232451741584,
"std_ttft": 0.24376142183145486,
"p99_ttft": 1.55091424578859,
"mean_tpot": 0.031155865704754086,
"mean_intvty": 32.09668476159241,
"median_tpot": 0.03126428169717604,
"median_intvty": 31.985382222625173,
"std_tpot": 0.0006454958370875492,
"std_intvty": 1549.196667962971,
"p99_tpot": 0.03177064982995281,
"p99_intvty": 31.475591634176066,
"mean_itl": 0.03116330467903974,
"median_itl": 0.027022671012673527,
"std_itl": 0.013900402869031152,
"p99_itl": 0.08222846371354539,
"mean_e2el": 28.854641751212547,
"median_e2el": 28.862312199518783,
"std_e2el": 1.987083004970599,
"p99_e2el": 32.30646628647577
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 345.121590316537,
"output_tput_per_gpu": 171.64717392992364,
"input_tput_per_gpu": 173.47441638661334,
"mean_ttft": 0.4287117916886928,
"median_ttft": 0.29100034397561103,
"std_ttft": 0.3754468470905531,
"p99_ttft": 1.449857907204423,
"mean_tpot": 0.010927759724302835,
"mean_intvty": 91.51006475517997,
"median_tpot": 0.010944936333709682,
"median_intvty": 91.36645198383347,
"std_tpot": 0.0004429086785558657,
"std_intvty": 2257.801773631009,
"p99_tpot": 0.011884931081937746,
"p99_intvty": 84.14015976245423,
"mean_itl": 0.10880163688398721,
"median_itl": 0.09833925892598927,
"std_itl": 0.04595109893382239,
"p99_itl": 0.20008037610910834,
"mean_e2el": 10.430982121020497,
"median_e2el": 10.334178404998966,
"std_e2el": 0.8883035576179695,
"p99_e2el": 12.850713363841642
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 591.6975315742928,
"output_tput_per_gpu": 294.28239775694084,
"input_tput_per_gpu": 297.415133817352,
"mean_ttft": 0.4090025126695764,
"median_ttft": 0.32702109598903917,
"std_ttft": 0.2800616933864141,
"p99_ttft": 1.1169804487048531,
"mean_tpot": 0.012855804366015569,
"mean_intvty": 77.78587566589833,
"median_tpot": 0.01288727161423733,
"median_intvty": 77.59594349631314,
"std_tpot": 0.0004046021148751569,
"std_intvty": 2471.563946986678,
"p99_tpot": 0.013792283890053094,
"p99_intvty": 72.50430805888455,
"mean_itl": 0.12799021825366264,
"median_itl": 0.11939197301398963,
"std_itl": 0.04085137925509727,
"p99_itl": 0.2096828156861011,
"mean_e2el": 12.175303138899835,
"median_e2el": 12.111572517009336,
"std_e2el": 0.9588628085163959,
"p99_e2el": 14.376447648540488
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 376.08593596306514,
"output_tput_per_gpu": 188.34085366799343,
"input_tput_per_gpu": 187.7450822950717,
"mean_ttft": 6.998798271215582,
"median_ttft": 0.33056944848794956,
"std_ttft": 20.38128105719959,
"p99_ttft": 69.39045825751091,
"mean_tpot": 0.013326483664183916,
"mean_intvty": 75.03854919265665,
"median_tpot": 0.013125619920767528,
"median_intvty": 76.18687772741208,
"std_tpot": 0.004215008131052788,
"std_intvty": 237.24746641241447,
"p99_tpot": 0.013970387824769476,
"p99_intvty": 71.57997419563411,
"mean_itl": 0.13259072985880746,
"median_itl": 0.1174843479966512,
"std_itl": 0.39869151662864727,
"p99_itl": 0.26701540267007656,
"mean_e2el": 19.297417002759357,
"median_e2el": 12.587621595492237,
"std_e2el": 20.635825312768564,
"p99_e2el": 82.8554250155951
},
{
"hw": "gb200",
"conc": 2252,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 8,
"tput_per_gpu": 9565.964844444936,
"output_tput_per_gpu": 7176.685588522945,
"input_tput_per_gpu": 14344.523356288919,
"mean_ttft": 5.374154983334451,
"median_ttft": 3.135583871626295,
"std_ttft": 6.370289718143169,
"p99_ttft": 32.83654104531736,
"mean_tpot": 0.03158022985201884,
"mean_intvty": 31.6653806728412,
"median_tpot": 0.0320205555867729,
"median_intvty": 31.22993907117219,
"std_tpot": 0.002056902606019846,
"std_intvty": 486.1678900465896,
"p99_tpot": 0.0344045670103347,
"p99_intvty": 29.0659085957865,
"mean_itl": 1.093646306347774,
"median_itl": 1.1256944275228307,
"std_itl": 0.17994813718568728,
"p99_itl": 1.5281442321743817,
"mean_e2el": 34.46553513161057,
"median_e2el": 32.975652318564244,
"std_e2el": 6.723949130269238,
"p99_e2el": 62.323754684999585
},
{
"hw": "h200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 144.44045336262005,
"output_tput_per_gpu": 72.48793837147186,
"input_tput_per_gpu": 71.95251499114819,
"mean_ttft": 0.15794080712657887,
"median_ttft": 0.11560489097610116,
"std_ttft": 0.11723623617261816,
"p99_ttft": 0.5249040435464122,
"mean_tpot": 0.013352168637406769,
"mean_intvty": 74.89420087149362,
"median_tpot": 0.013362889827442888,
"median_intvty": 74.83411244971397,
"std_tpot": 0.00042764836222422857,
"std_intvty": 2338.3697643524956,
"p99_tpot": 0.014155767784631543,
"p99_intvty": 70.6425829537602,
"mean_itl": 0.13288864953492957,
"median_itl": 0.12755411153193563,
"std_itl": 0.026600275860857486,
"p99_itl": 0.22374155048746616,
"mean_e2el": 12.55312944536272,
"median_e2el": 12.541978692403063,
"std_e2el": 0.9164543979314959,
"p99_e2el": 14.549708433914928
},
{
"hw": "mi325x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 265.85040830932485,
"output_tput_per_gpu": 133.13577579215888,
"input_tput_per_gpu": 132.71463251716597,
"mean_ttft": 0.3410089746554149,
"median_ttft": 0.1312808699440211,
"std_ttft": 0.6103937126254345,
"p99_ttft": 2.2063028247514738,
"mean_tpot": 0.0288512910442555,
"mean_intvty": 34.6604939954362,
"median_tpot": 0.029053521512887044,
"median_intvty": 34.41923553247884,
"std_tpot": 0.0008179312454917885,
"std_intvty": 1222.5966491825861,
"p99_tpot": 0.029924656885910495,
"p99_intvty": 33.41725867777059,
"mean_itl": 0.028863966021323222,
"median_itl": 0.026629538740962744,
"std_itl": 0.01149186789899497,
"p99_itl": 0.0748907681554556,
"mean_e2el": 26.973519575050158,
"median_e2el": 27.014708928065374,
"std_e2el": 2.015044186051719,
"p99_e2el": 31.37833488534205
},
{
"hw": "h200",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 188.60095504951445,
"output_tput_per_gpu": 93.801204685694,
"input_tput_per_gpu": 94.79975036382045,
"mean_ttft": 0.6688403705309611,
"median_ttft": 0.1129344068467617,
"std_ttft": 1.7241678215113387,
"p99_ttft": 7.4615381137467915,
"mean_tpot": 0.02018827097949587,
"mean_intvty": 49.53371197640677,
"median_tpot": 0.02006386798346753,
"median_intvty": 49.84083830814638,
"std_tpot": 0.0010003112701330579,
"std_intvty": 999.6888267258886,
"p99_tpot": 0.023235139422495135,
"p99_intvty": 43.038261222217955,
"mean_itl": 0.2007144330078277,
"median_itl": 0.19081574399024248,
"std_itl": 0.09424814280618914,
"p99_itl": 0.3050570430606604,
"mean_e2el": 19.140840336726978,
"median_e2el": 18.724031593184918,
"std_e2el": 2.415333818167487,
"p99_e2el": 26.955082109412178
},
{
"hw": "mi300x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 145.9407901044831,
"output_tput_per_gpu": 72.58405409977149,
"input_tput_per_gpu": 73.35673600471162,
"mean_ttft": 0.28639606877113694,
"median_ttft": 0.15204137470573187,
"std_ttft": 0.3995107737380535,
"p99_ttft": 1.528022503554821,
"mean_tpot": 0.026627124813234943,
"mean_intvty": 37.555688307096254,
"median_tpot": 0.02677065003850398,
"median_intvty": 37.354341361218694,
"std_tpot": 0.0005422849115978797,
"std_intvty": 1844.0490941439461,
"p99_tpot": 0.02732710420093871,
"p99_intvty": 36.59370537935187,
"mean_itl": 0.026633314140908092,
"median_itl": 0.02516160160303116,
"std_itl": 0.012500358947689737,
"p99_itl": 0.12531440477818248,
"mean_e2el": 24.65304853579728,
"median_e2el": 24.512640615925193,
"std_e2el": 1.7986590725432812,
"p99_e2el": 28.2529873347003
},
{
"hw": "gb200",
"conc": 64,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 369.2547065443617,
"output_tput_per_gpu": 231.14978519935755,
"input_tput_per_gpu": 921.6743919243781,
"mean_ttft": 0.8270642481969844,
"median_ttft": 0.4514468389097601,
"std_ttft": 0.9972101875230466,
"p99_ttft": 4.402157481594477,
"mean_tpot": 0.01551686899186248,
"mean_intvty": 64.44599103881269,
"median_tpot": 0.015370686094954806,
"median_intvty": 65.0589045812493,
"std_tpot": 0.0006835531952533813,
"std_intvty": 1462.9439331774572,
"p99_tpot": 0.01698190977734623,
"p99_intvty": 58.88619201910932,
"mean_itl": 0.15433621767522684,
"median_itl": 0.1543394629843533,
"std_itl": 0.01465307230234735,
"p99_itl": 0.18508471085689962,
"mean_e2el": 15.14271294083519,
"median_e2el": 15.005144527531229,
"std_e2el": 1.4340534060719854,
"p99_e2el": 19.146446354161018
},
{
"hw": "gb200",
"conc": 16,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 142.42458138976733,
"output_tput_per_gpu": 89.34533260953395,
"input_tput_per_gpu": 354.74157651070084,
"mean_ttft": 0.43835763942915945,
"median_ttft": 0.3568364870734513,
"std_ttft": 0.26540060765389173,
"p99_ttft": 1.187165441869292,
"mean_tpot": 0.010141933251875089,
"mean_intvty": 98.60053060545583,
"median_tpot": 0.01014871645207556,
"median_intvty": 98.53462797213982,
"std_tpot": 0.00013018611906185802,
"std_intvty": 7681.310474620182,
"p99_tpot": 0.01047412996031159,
"p99_intvty": 95.47332368313019,
"mean_itl": 0.10093194153805742,
"median_itl": 0.10092792799696326,
"std_itl": 0.006848242090987182,
"p99_itl": 0.12125953753711656,
"mean_e2el": 9.852784376783529,
"median_e2el": 9.80505621805787,
"std_e2el": 0.7130732098334741,
"p99_e2el": 11.30290693300776
},
{
"hw": "gb200",
"conc": 8,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 73.46534526379665,
"output_tput_per_gpu": 45.6924136127911,
"input_tput_per_gpu": 184.55707186781882,
"mean_ttft": 0.39117412944906393,
"median_ttft": 0.3618117090081796,
"std_ttft": 0.14675603200311926,
"p99_ttft": 0.6161122514656745,
"mean_tpot": 0.010038821876919363,
"mean_intvty": 99.61328254056762,
"median_tpot": 0.010071348848853455,
"median_intvty": 99.2915661057498,
"std_tpot": 0.00013552325700286308,
"std_intvty": 7378.8073140750585,
"p99_tpot": 0.010216100507633495,
"p99_intvty": 97.88470652308065,
"mean_itl": 0.09994822494875504,
"median_itl": 0.1004420870449394,
"std_itl": 0.00584334918747307,
"p99_itl": 0.10443910209229217,
"mean_e2el": 9.596405503235292,
"median_e2el": 9.610710931476206,
"std_e2el": 0.6846882197889491,
"p99_e2el": 10.864912356187125
},
{
"hw": "gb200",
"conc": 2,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 19.16741824702025,
"output_tput_per_gpu": 12.022041042316383,
"input_tput_per_gpu": 47.74892706583572,
"mean_ttft": 0.2508541756309569,
"median_ttft": 0.2253928908612579,
"std_ttft": 0.05659605374956713,
"p99_ttft": 0.37547269876580686,
"mean_tpot": 0.009992066656203408,
"mean_intvty": 100.07939642588018,
"median_tpot": 0.009923142353868803,
"median_intvty": 100.77452931128447,
"std_tpot": 0.00015203545920880457,
"std_intvty": 6577.41296145004,
"p99_tpot": 0.010203375200618228,
"p99_intvty": 98.00678504299337,
"mean_itl": 0.0994283992160313,
"median_itl": 0.0991393809672445,
"std_itl": 0.005487413834778653,
"p99_itl": 0.10233251522295177,
"mean_e2el": 9.447980901552365,
"median_e2el": 9.347289866884239,
"std_e2el": 0.5026390952394821,
"p99_e2el": 10.416231317531784
},
{
"hw": "gb200",
"conc": 128,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 549.2274768221029,
"output_tput_per_gpu": 343.18980477415954,
"input_tput_per_gpu": 1373.378165013876,
"mean_ttft": 1.5116792790606268,
"median_ttft": 0.5904859320726246,
"std_ttft": 2.299484187479841,
"p99_ttft": 9.883723249626348,
"mean_tpot": 0.02010225303510214,
"mean_intvty": 49.74566772459886,
"median_tpot": 0.020274090305293722,
"median_intvty": 49.32403796874143,
"std_tpot": 0.0009383465250459292,
"std_intvty": 1065.704378189127,
"p99_tpot": 0.021193507613449555,
"p99_intvty": 47.18426124826042,
"mean_itl": 0.20003320885117806,
"median_itl": 0.202942225150764,
"std_itl": 0.01937618510530571,
"p99_itl": 0.22431977200787515,
"mean_e2el": 20.019439281575615,
"median_e2el": 19.382260965532623,
"std_e2el": 2.8887589999896885,
"p99_e2el": 29.421967148412953
},
{
"hw": "gb200",
"conc": 4,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 37.07832552156601,
"output_tput_per_gpu": 23.38776299236849,
"input_tput_per_gpu": 91.84057563835607,
"mean_ttft": 0.2848202320281416,
"median_ttft": 0.269202574971132,
"std_ttft": 0.07170948014447356,
"p99_ttft": 0.44365355880232527,
"mean_tpot": 0.009954524750513092,
"mean_intvty": 100.45682994042045,
"median_tpot": 0.009906810890410695,
"median_intvty": 100.94065699466927,
"std_tpot": 0.0001285214031148149,
"std_intvty": 7780.805187029026,
"p99_tpot": 0.010195688043672196,
"p99_intvty": 98.0806783923362,
"mean_itl": 0.0991256110620117,
"median_itl": 0.09905814949888736,
"std_itl": 0.00498408178976311,
"p99_itl": 0.10972817806759849,
"mean_e2el": 9.582802380016073,
"median_e2el": 9.586307910620235,
"std_e2el": 0.5309077801318303,
"p99_e2el": 10.413642256180756
},
{
"hw": "gb200",
"conc": 2150,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 2,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 8,
"num_decode_gpu": 16,
"tput_per_gpu": 7779.661522604285,
"output_tput_per_gpu": 5837.002387451842,
"input_tput_per_gpu": 11664.979792909171,
"mean_ttft": 2.2882813471733714,
"median_ttft": 1.019085165928118,
"std_ttft": 3.750244053695363,
"p99_ttft": 17.69374180560931,
"mean_tpot": 0.01936886569157753,
"mean_intvty": 51.62924953498159,
"median_tpot": 0.019454735662928227,
"median_intvty": 51.40136660430395,
"std_tpot": 0.001122439813793404,
"std_intvty": 890.9163660369409,
"p99_tpot": 0.021635754025400737,
"p99_intvty": 46.21978965124041,
"mean_itl": 0.6729626038479152,
"median_itl": 0.6757575224619359,
"std_itl": 0.20694689671667543,
"p99_itl": 1.3169609571690672,
"mean_e2el": 20.13102391845835,
"median_e2el": 19.213375322520733,
"std_e2el": 3.8795842324646017,
"p99_e2el": 35.45968775685409
},
{
"hw": "gb200",
"conc": 4096,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 2,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 4244.133401423087,
"output_tput_per_gpu": 3183.7980300916183,
"input_tput_per_gpu": 6364.804144086024,
"mean_ttft": 3.9048184817184164,
"median_ttft": 1.0373276659520343,
"std_ttft": 6.300515198226906,
"p99_ttft": 23.668318262118845,
"mean_tpot": 0.033368430059067054,
"mean_intvty": 29.968446169923254,
"median_tpot": 0.03439645704521005,
"median_intvty": 29.072761729082128,
"std_tpot": 0.0023285690282117276,
"std_intvty": 429.44829544863035,
"p99_tpot": 0.03564511262533135,
"p99_intvty": 28.05433694405964,
"mean_itl": 1.6253092467528782,
"median_itl": 1.699446365935728,
"std_itl": 0.27450068649372333,
"p99_itl": 1.9895442705275492,
"mean_e2el": 34.64720946440609,
"median_e2el": 33.17021796293557,
"std_e2el": 6.438682169857855,
"p99_e2el": 56.18803137874463
},
{
"hw": "h200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 300.1205120578451,
"output_tput_per_gpu": 150.29797192363262,
"input_tput_per_gpu": 149.82254013421246,
"mean_ttft": 0.29685301923309454,
"median_ttft": 0.17625455418601632,
"std_ttft": 0.2927745539982455,
"p99_ttft": 1.1773992246761917,
"mean_tpot": 0.02580257027532431,
"mean_intvty": 38.75582894764274,
"median_tpot": 0.026118693609020506,
"median_intvty": 38.28675411447968,
"std_tpot": 0.0014320337154244552,
"std_intvty": 698.307581189595,
"p99_tpot": 0.027484693445275318,
"p99_intvty": 36.383887707938115,
"mean_itl": 0.25686865015909,
"median_itl": 0.22214747732505202,
"std_itl": 0.08631182474967977,
"p99_itl": 0.6101080379541953,
"mean_e2el": 24.123025546537246,
"median_e2el": 24.249714117031544,
"std_e2el": 2.1815695777601647,
"p99_e2el": 27.914561686739326
},
{
"hw": "gb200",
"conc": 4300,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 8,
"tput_per_gpu": 10174.958034288868,
"output_tput_per_gpu": 7628.433424529941,
"input_tput_per_gpu": 15268.007253806723,
"mean_ttft": 13.670401084246047,
"median_ttft": 9.438384462031536,
"std_ttft": 10.635169828625827,
"p99_ttft": 60.16970292568674,
"mean_tpot": 0.05262665971987734,
"mean_intvty": 19.001776007119357,
"median_tpot": 0.05425011406303042,
"median_intvty": 18.433140967006103,
"std_tpot": 0.004045703804077536,
"std_intvty": 247.1757816259638,
"p99_tpot": 0.05513972115032886,
"p99_intvty": 18.135746411804913,
"mean_itl": 1.0405354507394053,
"median_itl": 1.0769842360168695,
"std_itl": 0.27911524862124326,
"p99_itl": 1.8041910778265446,
"mean_e2el": 62.106449018242756,
"median_e2el": 60.163522047922015,
"std_e2el": 10.654577982872771,
"p99_e2el": 106.53037952416811
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 378.2064753362504,
"output_tput_per_gpu": 189.8042206156671,
"input_tput_per_gpu": 188.40225472058327,
"mean_ttft": 0.12040181559859775,
"median_ttft": 0.08387166541069746,
"std_ttft": 0.1144210309287736,
"p99_ttft": 0.6233258621720598,
"mean_tpot": 0.01020989147846824,
"mean_intvty": 97.94423399199803,
"median_tpot": 0.010219971794660134,
"median_intvty": 97.84762816297528,
"std_tpot": 0.00023104219384646738,
"std_intvty": 4328.213748976614,
"p99_tpot": 0.01052341271480117,
"p99_intvty": 95.02620747673431,
"mean_itl": 0.10152367619137469,
"median_itl": 0.0983569361269474,
"std_itl": 0.017454745895662968,
"p99_itl": 0.16021439103409646,
"mean_e2el": 9.598905873141485,
"median_e2el": 9.645646893884987,
"std_e2el": 0.6771438655465893,
"p99_e2el": 10.754188930792735
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 2114.1922786646155,
"output_tput_per_gpu": 1055.9362782453095,
"input_tput_per_gpu": 1058.256000419306,
"mean_ttft": 0.6532843066768692,
"median_ttft": 0.16898740082979202,
"std_ttft": 1.7107521341864345,
"p99_ttft": 8.835083985212261,
"mean_tpot": 0.028867022477270004,
"mean_intvty": 34.64160534005208,
"median_tpot": 0.029129463913691325,
"median_intvty": 34.329502354143344,
"std_tpot": 0.0016714122095771253,
"std_intvty": 598.29645509949,
"p99_tpot": 0.03221425186577414,
"p99_intvty": 31.042161220029595,
"mean_itl": 0.28714681447905144,
"median_itl": 0.24296033149585128,
"std_itl": 0.08444705281238274,
"p99_itl": 0.6248380776215343,
"mean_e2el": 27.198661110355896,
"median_e2el": 26.954308703076094,
"std_e2el": 3.157631187470903,
"p99_e2el": 37.42864547606092
},
{
"hw": "h200",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 117.32871414110747,
"output_tput_per_gpu": 58.88181878322487,
"input_tput_per_gpu": 58.446895357882596,
"mean_ttft": 0.29144007361028346,
"median_ttft": 0.10553608369082212,
"std_ttft": 0.5253226253779885,
"p99_ttft": 2.00246681207791,
"mean_tpot": 0.016347644074159503,
"mean_intvty": 61.17089382810128,
"median_tpot": 0.016349283292947513,
"median_intvty": 61.16476068595397,
"std_tpot": 0.0004821108343655462,
"std_intvty": 2074.21183827157,
"p99_tpot": 0.017890705160934297,
"p99_intvty": 55.89494606303028,
"mean_itl": 0.1625246226062859,
"median_itl": 0.15966126322746277,
"std_itl": 0.04237352606898844,
"p99_itl": 0.22143338106572627,
"mean_e2el": 15.465144980861805,
"median_e2el": 15.544434817507863,
"std_e2el": 1.2262121790632936,
"p99_e2el": 17.878225984070447
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 601.2893045180805,
"output_tput_per_gpu": 300.7469428036258,
"input_tput_per_gpu": 300.54236171445467,
"mean_ttft": 1.3847566446684505,
"median_ttft": 0.48940439452417195,
"std_ttft": 7.9552599064731675,
"p99_ttft": 74.83421286032069,
"mean_tpot": 0.024963507826510224,
"mean_intvty": 40.05847282960934,
"median_tpot": 0.024399550985048934,
"median_intvty": 40.98436076191566,
"std_tpot": 0.007326251005332097,
"std_intvty": 136.4954598569163,
"p99_tpot": 0.02635633873325382,
"p99_intvty": 37.9415369532453,
"mean_itl": 0.7371267286918883,
"median_itl": 0.7205090830102563,
"std_itl": 1.1897168318420006,
"p99_itl": 0.9564615071751177,
"mean_e2el": 24.362148402571258,
"median_e2el": 22.92119104496669,
"std_e2el": 10.507153438890304,
"p99_e2el": 97.91282886922126
},
{
"hw": "h200",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 69.00918094284734,
"output_tput_per_gpu": 34.33669061131495,
"input_tput_per_gpu": 34.67249033153239,
"mean_ttft": 0.31327792736701665,
"median_ttft": 0.12323521776124835,
"std_ttft": 0.5209406402374375,
"p99_ttft": 1.9304639371391386,
"mean_tpot": 0.013820446557735313,
"mean_intvty": 72.35656212861656,
"median_tpot": 0.013833377276012283,
"median_intvty": 72.28892699500405,
"std_tpot": 0.00028118418017168386,
"std_intvty": 3556.38784297689,
"p99_tpot": 0.014359761685012086,
"p99_intvty": 69.63903872052028,
"mean_itl": 0.13732033936355784,
"median_itl": 0.13542452454566956,
"std_itl": 0.02687858623785908,
"p99_itl": 0.22157836593687535,
"mean_e2el": 12.984511947724968,
"median_e2el": 13.041614898014814,
"std_e2el": 1.0617976522561416,
"p99_e2el": 15.736130541441963
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 98.6450892768569,
"output_tput_per_gpu": 49.45462028218766,
"input_tput_per_gpu": 49.19046899466923,
"mean_ttft": 0.10774247066983662,
"median_ttft": 0.10164526349399239,
"std_ttft": 0.033814203115657374,
"p99_ttft": 0.2408309916345737,
"mean_tpot": 0.0199555036377042,
"mean_intvty": 50.11148894837143,
"median_tpot": 0.019966499546151303,
"median_intvty": 50.08389165504765,
"std_tpot": 0.00017415392062817336,
"std_intvty": 5742.0470144628325,
"p99_tpot": 0.020283994796128715,
"p99_intvty": 49.29995348800101,
"mean_itl": 0.019956584902269463,
"median_itl": 0.019498449997627176,
"std_itl": 0.004403332083795464,
"p99_itl": 0.03609780419385057,
"mean_e2el": 18.573071370802428,
"median_e2el": 18.57070857749204,
"std_e2el": 1.14884731113673,
"p99_e2el": 20.572804990474832
},
{
"hw": "gb200",
"conc": 1075,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 2792.6246801122975,
"output_tput_per_gpu": 1571.108202945853,
"input_tput_per_gpu": 12564.756497443854,
"mean_ttft": 1.837948133694739,
"median_ttft": 0.8739883714588359,
"std_ttft": 2.9986676734966005,
"p99_ttft": 14.633168410130311,
"mean_tpot": 0.018058968118651802,
"mean_intvty": 55.3741494768559,
"median_tpot": 0.018022274523237132,
"median_intvty": 55.48689199638169,
"std_tpot": 0.0002463139287413071,
"std_intvty": 4059.859729046248,
"p99_tpot": 0.0186554579039249,
"p99_intvty": 53.60361590425562,
"mean_itl": 0.35705229326821364,
"median_itl": 0.3583208420313895,
"std_itl": 0.09816686150576326,
"p99_itl": 0.6723199915746227,
"mean_e2el": 18.468073763782836,
"median_e2el": 17.743848473532125,
"std_e2el": 3.1805656680907304,
"p99_e2el": 31.626253874243712
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 203.81281760259895,
"output_tput_per_gpu": 102.17929323263384,
"input_tput_per_gpu": 101.63352436996512,
"mean_ttft": 0.21936451276749722,
"median_ttft": 0.12129646299581509,
"std_ttft": 0.32991483535045485,
"p99_ttft": 2.612228669367905,
"mean_tpot": 0.019218743683179437,
"mean_intvty": 52.03253742726256,
"median_tpot": 0.01922359228998554,
"median_intvty": 52.01941369308724,
"std_tpot": 0.0004241450355786217,
"std_intvty": 2357.684084727746,
"p99_tpot": 0.020383353072807277,
"p99_intvty": 49.0596417786662,
"mean_itl": 0.019222518044605737,
"median_itl": 0.018334227002924308,
"std_itl": 0.012679762219095702,
"p99_itl": 0.03890847842849324,
"mean_e2el": 18.005479660739876,
"median_e2el": 17.978295980996336,
"std_e2el": 1.2369716478427177,
"p99_e2el": 20.307550822279293
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 120.24509077939304,
"output_tput_per_gpu": 59.82998817274678,
"input_tput_per_gpu": 60.41510260664626,
"mean_ttft": 0.2297067509847693,
"median_ttft": 0.15984388394281268,
"std_ttft": 0.2251969081523954,
"p99_ttft": 0.962432309994474,
"mean_tpot": 0.007868964688318408,
"mean_intvty": 127.08152083647222,
"median_tpot": 0.00786640596167858,
"median_intvty": 127.12285697833653,
"std_tpot": 0.00015613741143868375,
"std_intvty": 6404.614952853288,
"p99_tpot": 0.008283681192286757,
"p99_intvty": 120.71927646505,
"mean_itl": 0.07833881631160375,
"median_itl": 0.07552170264534652,
"std_itl": 0.019457931635451328,
"p99_itl": 0.1676410868158564,
"mean_e2el": 7.444711529940832,
"median_e2el": 7.45253109629266,
"std_e2el": 0.6123301842447513,
"p99_e2el": 9.126274410281331
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 206.24243122047466,
"output_tput_per_gpu": 103.50347354809152,
"input_tput_per_gpu": 102.73895767238314,
"mean_ttft": 0.27022171078715473,
"median_ttft": 0.16239808546379209,
"std_ttft": 0.30342116897091187,
"p99_ttft": 1.1751088054012506,
"mean_tpot": 0.00916415701088168,
"mean_intvty": 109.1207842480855,
"median_tpot": 0.00915676768178727,
"median_intvty": 109.20884254702575,
"std_tpot": 0.00029198749857808394,
"std_intvty": 3424.804160691071,
"p99_tpot": 0.00999270824329895,
"p99_intvty": 100.07297077552465,
"mean_itl": 0.09120682995429168,
"median_itl": 0.08455537562258542,
"std_itl": 0.03277008963791707,
"p99_itl": 0.18098834120202809,
"mean_e2el": 8.777538600657135,
"median_e2el": 8.742960831848904,
"std_e2el": 0.6976562064184395,
"p99_e2el": 10.137662594267166
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 439.1887015694889,
"output_tput_per_gpu": 219.39400239211665,
"input_tput_per_gpu": 219.79469917737225,
"mean_ttft": 0.43220331492339026,
"median_ttft": 0.38396384846419096,
"std_ttft": 0.34183205058326116,
"p99_ttft": 2.513194294311106,
"mean_tpot": 0.01766089295007497,
"mean_intvty": 56.62227854655305,
"median_tpot": 0.017743388097921045,
"median_intvty": 56.35902199068553,
"std_tpot": 0.00048603618073616797,
"std_intvty": 2057.459999141142,
"p99_tpot": 0.018570088737930604,
"p99_intvty": 53.85003885078026,
"mean_itl": 0.5218189918077257,
"median_itl": 0.5429791060741991,
"std_itl": 0.09011961194626975,
"p99_itl": 0.6960355830844491,
"mean_e2el": 16.67577611648565,
"median_e2el": 16.635727596236393,
"std_e2el": 1.258674462950328,
"p99_e2el": 19.447309473971835
},
{
"hw": "mi300x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 240.28322894738753,
"output_tput_per_gpu": 120.33193516307752,
"input_tput_per_gpu": 119.95129378431001,
"mean_ttft": 0.43875684653175995,
"median_ttft": 0.18919912353157997,
"std_ttft": 0.7096973220616996,
"p99_ttft": 2.6078403209522367,
"mean_tpot": 0.031945752959704254,
"mean_intvty": 31.303065583127132,
"median_tpot": 0.03234773830294439,
"median_intvty": 30.914062387755155,
"std_tpot": 0.0014172512424314118,
"std_intvty": 705.5911965788206,
"p99_tpot": 0.03365913397600829,
"p99_intvty": 29.709617624529034,
"mean_itl": 0.031964843734840896,
"median_itl": 0.028010625392198563,
"std_itl": 0.022002877340968285,
"p99_itl": 0.1586702957749367,
"mean_e2el": 29.932418106409024,
"median_e2el": 30.013407521881163,
"std_e2el": 2.425361545144931,
"p99_e2el": 34.3935982822068
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 221.12202772297672,
"output_tput_per_gpu": 110.02302229262172,
"input_tput_per_gpu": 111.099005430355,
"mean_ttft": 0.20701409892644734,
"median_ttft": 0.16666134190745652,
"std_ttft": 0.1535687122626908,
"p99_ttft": 0.7148557073948905,
"mean_tpot": 0.008610545537568204,
"mean_intvty": 116.13666005678203,
"median_tpot": 0.008633882844184435,
"median_intvty": 115.82274372341926,
"std_tpot": 0.00016951327722173728,
"std_intvty": 5899.242917072023,
"p99_tpot": 0.008966876285553389,
"p99_intvty": 111.52155646566783,
"mean_itl": 0.08571590155892198,
"median_itl": 0.08308944059535861,
"std_itl": 0.01901525995653771,
"p99_itl": 0.17541313607245684,
"mean_e2el": 8.101448465150316,
"median_e2el": 8.142378360033035,
"std_e2el": 0.6163142913619017,
"p99_e2el": 9.565222159053665
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1366.5673523618073,
"output_tput_per_gpu": 683.129672362983,
"input_tput_per_gpu": 683.4376799988243,
"mean_ttft": 0.4568303398431908,
"median_ttft": 0.11456687549070921,
"std_ttft": 1.0374367950486671,
"p99_ttft": 5.381849112740894,
"mean_tpot": 0.02234443699234901,
"mean_intvty": 44.75386872993987,
"median_tpot": 0.022562204283499206,
"median_intvty": 44.32191054715991,
"std_tpot": 0.0011427359356261106,
"std_intvty": 875.0928091292544,
"p99_tpot": 0.02448816586672067,
"p99_intvty": 40.83605139897376,
"mean_itl": 0.22236324790509093,
"median_itl": 0.18635848199483007,
"std_itl": 0.07782023240492508,
"p99_itl": 0.5049286349746398,
"mean_e2el": 21.05218365431306,
"median_e2el": 21.059494054497918,
"std_e2el": 2.166727250854411,
"p99_e2el": 27.261061567020082
},
{
"hw": "mi325x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 153.77998713545853,
"output_tput_per_gpu": 76.48290034411293,
"input_tput_per_gpu": 77.29708679134559,
"mean_ttft": 0.2514936144165404,
"median_ttft": 0.1376320410054177,
"std_ttft": 0.3493596041451523,
"p99_ttft": 1.3411809504963457,
"mean_tpot": 0.025269813441890478,
"mean_intvty": 39.57290790054952,
"median_tpot": 0.025379357224639104,
"median_intvty": 39.40210113080278,
"std_tpot": 0.0004439990990953196,
"std_intvty": 2252.25682222683,
"p99_tpot": 0.025863465503007717,
"p99_intvty": 38.66457880069119,
"mean_itl": 0.02527630599983262,
"median_itl": 0.023918416001833975,
"std_itl": 0.00893579474326899,
"p99_itl": 0.08887578076450145,
"mean_e2el": 23.376627623062813,
"median_e2el": 23.35095692804316,
"std_e2el": 1.6821515116766381,
"p99_e2el": 26.560598223117413
},
{
"hw": "mi300x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 83.21387824915277,
"output_tput_per_gpu": 41.761171041413945,
"input_tput_per_gpu": 41.452707207738825,
"mean_ttft": 0.2894865931244567,
"median_ttft": 0.17828640108928084,
"std_ttft": 0.3526266750443899,
"p99_ttft": 1.428675928255543,
"mean_tpot": 0.02311063989057251,
"mean_intvty": 43.270113018719506,
"median_tpot": 0.023170519148420894,
"median_intvty": 43.15829065349843,
"std_tpot": 0.00038067810625572906,
"std_intvty": 2626.89128575266,
"p99_tpot": 0.023844030985488558,
"p99_intvty": 41.939217433855816,
"mean_itl": 0.02311783327041665,
"median_itl": 0.02216945542022586,
"std_itl": 0.011508656619891719,
"p99_itl": 0.0254195882473141,
"mean_e2el": 21.74456925011473,
"median_e2el": 21.94070298410952,
"std_e2el": 1.5284897199732408,
"p99_e2el": 24.42748722226359
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 342.785513796456,
"output_tput_per_gpu": 171.6642663813588,
"input_tput_per_gpu": 171.12124741509717,
"mean_ttft": 0.3772049359686207,
"median_ttft": 0.1376246064901352,
"std_ttft": 0.7399091392518318,
"p99_ttft": 3.7817022288544107,
"mean_tpot": 0.022346557272562843,
"mean_intvty": 44.74962240504949,
"median_tpot": 0.022482151086371707,
"median_intvty": 44.47972954893016,
"std_tpot": 0.0008080899146106663,
"std_intvty": 1237.4860543604175,
"p99_tpot": 0.023623484823969695,
"p99_intvty": 42.330757187243805,
"mean_itl": 0.2222407169017928,
"median_itl": 0.19886512658558786,
"std_itl": 0.06334746271004797,
"p99_itl": 0.4483058636356144,
"mean_e2el": 21.003921266173712,
"median_e2el": 20.98152683977969,
"std_e2el": 1.8073383647896561,
"p99_e2el": 26.16535166225862
},
{
"hw": "gb200",
"conc": 1075,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 1,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 4,
"num_decode_gpu": 16,
"tput_per_gpu": 4442.0033986024655,
"output_tput_per_gpu": 2776.706016926881,
"input_tput_per_gpu": 11103.192925304802,
"mean_ttft": 1.9480372721163643,
"median_ttft": 0.9353012915235013,
"std_ttft": 3.0433942664282787,
"p99_ttft": 14.64567912986735,
"mean_tpot": 0.020550563984298366,
"mean_intvty": 48.66046502490388,
"median_tpot": 0.020383988802661422,
"median_intvty": 49.058111720971695,
"std_tpot": 0.0004260433798055789,
"std_intvty": 2347.1788259128475,
"p99_tpot": 0.021687672736124822,
"p99_intvty": 46.10914283736472,
"mean_itl": 0.4063176843756857,
"median_itl": 0.4049514851067215,
"std_itl": 0.09347745592232594,
"p99_itl": 0.716668958151713,
"mean_e2el": 20.872755769070423,
"median_e2el": 20.183781350497156,
"std_e2el": 3.237900787598791,
"p99_e2el": 34.04821030642604
},
{
"hw": "gb200",
"conc": 4,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 32.30548848847,
"output_tput_per_gpu": 18.242645000595903,
"input_tput_per_gpu": 144.80823639146277,
"mean_ttft": 0.14840737418853678,
"median_ttft": 0.14289077941793948,
"std_ttft": 0.042909841580251844,
"p99_ttft": 0.2658678555791266,
"mean_tpot": 0.006387308303465267,
"mean_intvty": 156.56047156162418,
"median_tpot": 0.0062471594259788585,
"median_intvty": 160.07275176002275,
"std_tpot": 0.00034853667305162386,
"std_intvty": 2869.138536396955,
"p99_tpot": 0.007677267151823523,
"p99_intvty": 130.25468310849095,
"mean_itl": 0.12613988418477137,
"median_itl": 0.12492778291925788,
"std_itl": 0.032114731512498454,
"p99_itl": 0.13669693334959454,
"mean_e2el": 6.084865549652022,
"median_e2el": 6.130423433962278,
"std_e2el": 0.5584283270248697,
"p99_e2el": 7.738454438014888
},
{
"hw": "gb200",
"conc": 32,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 193.88061955696213,
"output_tput_per_gpu": 108.76266393323671,
"input_tput_per_gpu": 874.8242645467654,
"mean_ttft": 0.2093679561976387,
"median_ttft": 0.15004791400860995,
"std_ttft": 0.17200220740846395,
"p99_ttft": 0.9295085422927503,
"mean_tpot": 0.00860492063866793,
"mean_intvty": 116.2125767327011,
"median_tpot": 0.008492272347939726,
"median_intvty": 117.75411327247478,
"std_tpot": 0.0004083274176912921,
"std_intvty": 2449.0150714200395,
"p99_tpot": 0.010123680577253494,
"p99_intvty": 98.77830423125569,
"mean_itl": 0.17013832593397857,
"median_itl": 0.16807796503417194,
"std_itl": 0.027877639896314154,
"p99_itl": 0.2994745094375641,
"mean_e2el": 8.093551296603437,
"median_e2el": 8.029026238364168,
"std_e2el": 0.6408468581357575,
"p99_e2el": 9.83711937664775
},
{
"hw": "gb200",
"conc": 256,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 889.9624711332857,
"output_tput_per_gpu": 500.31443833525276,
"input_tput_per_gpu": 4007.146733517549,
"mean_ttft": 0.6670898931645297,
"median_ttft": 0.2107954079983756,
"std_ttft": 1.4167201466626331,
"p99_ttft": 6.83380425700685,
"mean_tpot": 0.014547515477535801,
"mean_intvty": 68.7402602557251,
"median_tpot": 0.01411182000639381,
"median_intvty": 70.86258183189115,
"std_tpot": 0.0010825023852577375,
"std_intvty": 923.7854933334911,
"p99_tpot": 0.016755068665706724,
"p99_intvty": 59.68343191852985,
"mean_itl": 0.2875959412064402,
"median_itl": 0.2815297171473503,
"std_itl": 0.03471473059498881,
"p99_itl": 0.3370971063617617,
"mean_e2el": 14.047743736099052,
"median_e2el": 13.754055539960973,
"std_e2el": 1.854379863531056,
"p99_e2el": 20.471407575663175
},
{
"hw": "gb200",
"conc": 16,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 113.05836519386133,
"output_tput_per_gpu": 63.28888713157418,
"input_tput_per_gpu": 511.2141896921585,
"mean_ttft": 0.17807253625869635,
"median_ttft": 0.13083326700143516,
"std_ttft": 0.09339825318332076,
"p99_ttft": 0.5639588387473486,
"mean_tpot": 0.007518409364787082,
"mean_intvty": 133.00685709979555,
"median_tpot": 0.007513453579027073,
"median_intvty": 133.09458686101198,
"std_tpot": 9.004145068098673e-05,
"std_intvty": 11105.996098874064,
"p99_tpot": 0.0077366363256289854,
"p99_intvty": 129.2551385267163,
"mean_itl": 0.1486130380290456,
"median_itl": 0.14964435994625092,
"std_itl": 0.013255756954381144,
"p99_itl": 0.16641649325378238,
"mean_e2el": 7.073485194661771,
"median_e2el": 7.088631398044527,
"std_e2el": 0.4819814051232923,
"p99_e2el": 7.937781208301894
},
{
"hw": "gb200",
"conc": 64,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 331.7084493426723,
"output_tput_per_gpu": 187.106606833859,
"input_tput_per_gpu": 1488.5231894131787,
"mean_ttft": 0.2947931587536914,
"median_ttft": 0.18885245453566313,
"std_ttft": 0.3531773135932712,
"p99_ttft": 1.8415272141131571,
"mean_tpot": 0.010029004650070852,
"mean_intvty": 99.71079233600068,
"median_tpot": 0.010045340462369374,
"median_intvty": 99.54864185500509,
"std_tpot": 0.00016509895971075173,
"std_intvty": 6056.973355567891,
"p99_tpot": 0.010307073854953048,
"p99_intvty": 97.0207465351043,
"mean_itl": 0.19837653581187137,
"median_itl": 0.2000001723645255,
"std_itl": 0.017391809715383795,
"p99_itl": 0.21920554079581053,
"mean_e2el": 9.563471757051047,
"median_e2el": 9.541636548936367,
"std_e2el": 0.6892220220873352,
"p99_e2el": 11.61378111343598
},
{
"hw": "gb200",
"conc": 564,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 1545.4493769534138,
"output_tput_per_gpu": 869.40630670225,
"input_tput_per_gpu": 6953.7939389627245,
"mean_ttft": 2.441518805006113,
"median_ttft": 1.6333737700479105,
"std_ttft": 2.7245178959665166,
"p99_ttft": 14.688298106354196,
"mean_tpot": 0.016616069654606953,
"mean_intvty": 60.18270389969996,
"median_tpot": 0.016725962126458386,
"median_intvty": 59.78729309796324,
"std_tpot": 0.00045934587600503414,
"std_intvty": 2177.008768854258,
"p99_tpot": 0.016970812281506577,
"p99_intvty": 58.92469867748872,
"mean_itl": 0.3284953067926914,
"median_itl": 0.3337935770396143,
"std_itl": 0.03102841840298918,
"p99_itl": 0.3532176049705595,
"mean_e2el": 17.741085684470416,
"median_e2el": 17.13221336551942,
"std_e2el": 2.905664287107915,
"p99_e2el": 29.981312049129052
},
{
"hw": "gb200",
"conc": 128,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 548.3264733520737,
"output_tput_per_gpu": 308.18287091292643,
"input_tput_per_gpu": 2469.4752928652515,
"mean_ttft": 0.41526711276424066,
"median_ttft": 0.19203887740150094,
"std_ttft": 0.7042873918169126,
"p99_ttft": 3.5030307700508256,
"mean_tpot": 0.011994281027247911,
"mean_intvty": 83.37306735837338,
"median_tpot": 0.011786082392241527,
"median_intvty": 84.84583483467536,
"std_tpot": 0.0005816401758990426,
"std_intvty": 1719.2760084949387,
"p99_tpot": 0.013430561973805008,
"p99_intvty": 74.45704818237702,
"mean_itl": 0.23720926229856373,
"median_itl": 0.23530848696827888,
"std_itl": 0.024985655915077898,
"p99_itl": 0.2720074794720858,
"mean_e2el": 11.452215554241093,
"median_e2el": 11.329832831164822,
"std_e2el": 1.1193682218335972,
"p99_e2el": 14.931139022605493
},
{
"hw": "gb200",
"conc": 8,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 61.299110216476706,
"output_tput_per_gpu": 34.5057414434418,
"input_tput_per_gpu": 275.64606040075597,
"mean_ttft": 0.15919374672375852,
"median_ttft": 0.12568423303309828,
"std_ttft": 0.07043458638678603,
"p99_ttft": 0.4057272569951601,
"mean_tpot": 0.006728571503425011,
"mean_intvty": 148.6199558837972,
"median_tpot": 0.006689147116119343,
"median_intvty": 149.49588978095946,
"std_tpot": 0.000165634762200263,
"std_intvty": 6037.379996301356,
"p99_tpot": 0.0074147995356925196,
"p99_intvty": 134.86541277162164,
"mean_itl": 0.13295241337608124,
"median_itl": 0.13338100002147257,
"std_itl": 0.014608512970082246,
"p99_itl": 0.153345231898129,
"mean_e2el": 6.39341539315501,
"median_e2el": 6.405678987503052,
"std_e2el": 0.4058776802296038,
"p99_e2el": 7.189627567902206
},
{
"hw": "b200-trt",
"conc": 256,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 2128.852677927518,
"output_tput_per_gpu": 1064.940969960399,
"input_tput_per_gpu": 1063.9117079671191,
"mean_ttft": 1.310160240385369,
"median_ttft": 1.168279828998493,
"std_ttft": 0.8978656254236859,
"p99_ttft": 3.9658524450630646,
"mean_tpot": 0.027373944232363702,
"mean_intvty": 36.53108925449328,
"median_tpot": 0.027608027207607314,
"median_intvty": 36.22134940972721,
"std_tpot": 0.0008730665015690951,
"std_intvty": 1145.3881212974923,
"p99_tpot": 0.02854741118424829,
"p99_intvty": 35.02944605189887,
"mean_itl": 0.27226027570796396,
"median_itl": 0.24703382499865256,
"std_itl": 0.06409998257520498,
"p99_itl": 0.5082758385234046,
"mean_e2el": 26.533053960348944,
"median_e2el": 26.538300427011563,
"std_e2el": 2.13421962699385,
"p99_e2el": 31.03455472946865
},
{
"hw": "gb200",
"conc": 8,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 99.51521099812607,
"output_tput_per_gpu": 56.01787902408906,
"input_tput_per_gpu": 447.49386679042215,
"mean_ttft": 0.18750758845635573,
"median_ttft": 0.1494649270316586,
"std_ttft": 0.06651220631348966,
"p99_ttft": 0.40635371166281403,
"mean_tpot": 0.004036221760066496,
"mean_intvty": 247.75645627150206,
"median_tpot": 0.003986540320365812,
"median_intvty": 250.84407020577635,
"std_tpot": 0.0004519423185463171,
"std_intvty": 2212.6717480596267,
"p99_tpot": 0.005237616471229547,
"p99_intvty": 190.92654177583316,
"mean_itl": 0.18209876070071734,
"median_itl": 0.18269856786355376,
"std_itl": 0.04708065180617375,
"p99_itl": 0.23990691555198207,
"mean_e2el": 3.9290679442383407,
"median_e2el": 3.8794775566784665,
"std_e2el": 0.4844947665131011,
"p99_e2el": 5.294441736566366
},
{
"hw": "gb200",
"conc": 16,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 175.52461944172794,
"output_tput_per_gpu": 98.25684114228814,
"input_tput_per_gpu": 793.6668458372463,
"mean_ttft": 0.21040188713413954,
"median_ttft": 0.16469682636670768,
"std_ttft": 0.10812090479076027,
"p99_ttft": 0.6610149128641933,
"mean_tpot": 0.004692196283491124,
"mean_intvty": 213.11981417281467,
"median_tpot": 0.004715714771509992,
"median_intvty": 212.05693059332248,
"std_tpot": 0.00039921021092914204,
"std_intvty": 2504.9459473307293,
"p99_tpot": 0.005432069610460653,
"p99_intvty": 184.09189714253267,
"mean_itl": 0.21232542605911525,
"median_itl": 0.21622406505048275,
"std_itl": 0.02667942989675247,
"p99_itl": 0.244650762360543,
"mean_e2el": 4.511650454967821,
"median_e2el": 4.497550335130654,
"std_e2el": 0.4725469757470022,
"p99_e2el": 5.441198795947712
},
{
"hw": "gb200",
"conc": 64,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 479.8916806256891,
"output_tput_per_gpu": 270.69224250272833,
"input_tput_per_gpu": 2153.4871856093755,
"mean_ttft": 0.35485877435303337,
"median_ttft": 0.2365343039855361,
"std_ttft": 0.39787715251864675,
"p99_ttft": 2.092928964558523,
"mean_tpot": 0.006664123922441312,
"mean_intvty": 150.05723357462168,
"median_tpot": 0.006663868170609483,
"median_intvty": 150.06299260396972,
"std_tpot": 0.0007488264325204436,
"std_intvty": 1335.422945253337,
"p99_tpot": 0.00865190497431278,
"p99_intvty": 115.58148210931199,
"mean_itl": 0.3062318582727932,
"median_itl": 0.3048161950428039,
"std_itl": 0.04874367714333303,
"p99_itl": 0.42107235385105024,
"mean_e2el": 6.508803171305317,
"median_e2el": 6.419823716511019,
"std_e2el": 0.8529058887369169,
"p99_e2el": 8.816713589767458
},
{
"hw": "gb200",
"conc": 144,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 840.975570182399,
"output_tput_per_gpu": 472.11837840035133,
"input_tput_per_gpu": 3791.83310443878,
"mean_ttft": 1.2457016802365817,
"median_ttft": 0.9568504230119288,
"std_ttft": 1.1520934888264869,
"p99_ttft": 6.275029098873493,
"mean_tpot": 0.007679123476833952,
"mean_intvty": 130.22319578748235,
"median_tpot": 0.007759047048666262,
"median_intvty": 128.88180645481387,
"std_tpot": 0.0007452788364715315,
"std_intvty": 1341.7796817288242,
"p99_tpot": 0.008991732917540182,
"p99_intvty": 111.21326769496224,
"mean_itl": 0.35007406755007664,
"median_itl": 0.3609183810185641,
"std_itl": 0.04687425970042775,
"p99_itl": 0.3739598415605724,
"mean_e2el": 8.292461603375664,
"median_e2el": 8.148701257421635,
"std_e2el": 1.4007224146850852,
"p99_e2el": 13.462928508874029
},
{
"hw": "gb200",
"conc": 32,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 294.9592385759157,
"output_tput_per_gpu": 165.465494243536,
"input_tput_per_gpu": 1330.9091932349534,
"mean_ttft": 0.2577960086227904,
"median_ttft": 0.19730710645671934,
"std_ttft": 0.20271399300979337,
"p99_ttft": 1.1087125004036338,
"mean_tpot": 0.005508829952219666,
"mean_intvty": 181.52675044853603,
"median_tpot": 0.005547889245630062,
"median_intvty": 180.24873167532604,
"std_tpot": 0.00046927384983360635,
"std_intvty": 2130.951895901672,
"p99_tpot": 0.00633126788299178,
"p99_intvty": 157.94624686255725,
"mean_itl": 0.25001352380896114,
"median_itl": 0.2541605730075389,
"std_itl": 0.03360556282227612,
"p99_itl": 0.29149442691821603,
"mean_e2el": 5.303967301434568,
"median_e2el": 5.298168066539802,
"std_e2el": 0.5386612306643197,
"p99_e2el": 6.40852697287919
},
{
"hw": "gb200",
"conc": 4,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 1024,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 4,
"num_prefill_gpu": 4,
"num_decode_gpu": 32,
"tput_per_gpu": 49.02296004009678,
"output_tput_per_gpu": 27.682864390336267,
"input_tput_per_gpu": 219.74372523818087,
"mean_ttft": 0.6219795793585945,
"median_ttft": 0.1346952844178304,
"std_ttft": 1.374741126592172,
"p99_ttft": 4.935023965949658,
"mean_tpot": 0.003636352923824148,
"mean_intvty": 275.00081013818544,
"median_tpot": 0.0036389672215150965,
"median_intvty": 274.80324474691105,
"std_tpot": 0.00020975560073827006,
"std_intvty": 4767.453152527666,
"p99_tpot": 0.004138613512598813,
"p99_intvty": 241.6268146218024,
"mean_itl": 0.1632820755978074,
"median_itl": 0.16497785109095275,
"std_itl": 0.03199877654797429,
"p99_itl": 0.18763630567118517,
"mean_e2el": 3.9947748274280457,
"median_e2el": 3.519940600497648,
"std_e2el": 1.409657149455134,
"p99_e2el": 8.321354570840485
}
]
[
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 309.37028182306994,
"output_tput_per_gpu": 274.88330976582523,
"input_tput_per_gpu": 34.486972057244714,
"mean_ttft": 0.19576276233783574,
"median_ttft": 0.10887511249165982,
"std_ttft": 0.2931485006317912,
"p99_ttft": 1.5453500700599394,
"mean_tpot": 0.014215039857037055,
"mean_intvty": 70.34802646050669,
"median_tpot": 0.014236842968561928,
"median_intvty": 70.24029148935753,
"std_tpot": 0.00026195018532655873,
"std_intvty": 3817.5197270937433,
"p99_tpot": 0.014617866479870954,
"p99_intvty": 68.40943590345532,
"mean_itl": 0.1420455815237089,
"median_itl": 0.14148388998000883,
"std_itl": 0.013345640252464614,
"p99_itl": 0.2095448920445051,
"mean_e2el": 105.0582497928821,
"median_e2el": 105.13963892450556,
"std_e2el": 7.237539913483957,
"p99_e2el": 117.99002345560642
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 63.607946531299,
"output_tput_per_gpu": 56.52358204300893,
"input_tput_per_gpu": 7.0843644882900705,
"mean_ttft": 0.1851666160451714,
"median_ttft": 0.15393621241673827,
"std_ttft": 0.13023445846098358,
"p99_ttft": 0.7951863640686497,
"mean_tpot": 0.008667973828066186,
"mean_intvty": 115.36721497267126,
"median_tpot": 0.008685885675571131,
"median_intvty": 115.12930717157363,
"std_tpot": 8.488630149711112e-05,
"std_intvty": 11780.46377758645,
"p99_tpot": 0.008771708115209642,
"p99_intvty": 114.0028814075627,
"mean_itl": 0.2595071424240725,
"median_itl": 0.2595431061927229,
"std_itl": 0.015385182059725703,
"p99_itl": 0.34859713237732654,
"mean_e2el": 63.95255898209289,
"median_e2el": 64.17846440337598,
"std_e2el": 3.971360637632554,
"p99_e2el": 71.24048401063307
},
{
"hw": "h200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 184.47715901258852,
"output_tput_per_gpu": 164.10219345372272,
"input_tput_per_gpu": 20.374965558865796,
"mean_ttft": 0.24040484218858182,
"median_ttft": 0.13828571047633886,
"std_ttft": 0.30589033875156113,
"p99_ttft": 1.1733397391997278,
"mean_tpot": 0.02383611672721512,
"mean_intvty": 41.95314242853326,
"median_tpot": 0.02382154643008247,
"median_intvty": 41.97880280086158,
"std_tpot": 0.0006633211854322042,
"std_intvty": 1507.565297116847,
"p99_tpot": 0.024743203307826315,
"p99_intvty": 40.41513895994616,
"mean_itl": 0.23826572344120503,
"median_itl": 0.23661787249147892,
"std_itl": 0.023685250923894494,
"p99_itl": 0.3380245644599199,
"mean_e2el": 177.01942448309856,
"median_e2el": 177.0863628629595,
"std_e2el": 12.456891288656117,
"p99_e2el": 199.43760170377791
},
{
"hw": "h200",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 68.79598929783467,
"output_tput_per_gpu": 61.133804144938004,
"input_tput_per_gpu": 7.662185152896662,
"mean_ttft": 0.290583943692036,
"median_ttft": 0.10607097193133086,
"std_ttft": 0.5264846001969143,
"p99_ttft": 2.006596488284413,
"mean_tpot": 0.01593029432205044,
"mean_intvty": 62.77347924550378,
"median_tpot": 0.015951887813861386,
"median_intvty": 62.68850506402448,
"std_tpot": 0.000284283477845844,
"std_intvty": 3517.6156123369988,
"p99_tpot": 0.016357065250116445,
"p99_intvty": 61.13566123928503,
"mean_itl": 0.1591590861470368,
"median_itl": 0.15957718202844262,
"std_itl": 0.01580000897035974,
"p99_itl": 0.16817948897369206,
"mean_e2el": 117.47345044257818,
"median_e2el": 117.72639550152235,
"std_e2el": 7.300851683117578,
"p99_e2el": 132.1887364871357
},
{
"hw": "mi300x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 47.734654904380726,
"output_tput_per_gpu": 42.41818561859776,
"input_tput_per_gpu": 5.316469285782965,
"mean_ttft": 0.2853303897427395,
"median_ttft": 0.1778844348154962,
"std_ttft": 0.34192840128096946,
"p99_ttft": 1.3904687526449562,
"mean_tpot": 0.0228973936494073,
"mean_intvty": 43.67309289919488,
"median_tpot": 0.022889902466469193,
"median_intvty": 43.68738580100432,
"std_tpot": 0.00014032786740393176,
"std_intvty": 7126.168297858573,
"p99_tpot": 0.023155293810947636,
"p99_intvty": 43.186668593564036,
"mean_itl": 0.022897170630775895,
"median_itl": 0.022848219610750675,
"std_itl": 0.004030150613916717,
"p99_itl": 0.023343065176159142,
"mean_e2el": 168.73981428202242,
"median_e2el": 169.50895645469427,
"std_e2el": 10.430418343596353,
"p99_e2el": 187.94064782828093
},
{
"hw": "b200",
"conc": 128,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 736.2126353139171,
"output_tput_per_gpu": 654.3163368642431,
"input_tput_per_gpu": 81.89629844967396,
"mean_ttft": 7.751795760030218,
"median_ttft": 0.5092517968732864,
"std_ttft": 21.886912351514905,
"p99_ttft": 74.50283730366733,
"mean_tpot": 0.02280778012836505,
"mean_intvty": 43.84468783774109,
"median_tpot": 0.023173134937669343,
"median_intvty": 43.15341893489081,
"std_tpot": 0.0008858272386228715,
"std_intvty": 1128.8882937881028,
"p99_tpot": 0.023632325685824577,
"p99_intvty": 42.31492123518897,
"mean_itl": 0.22801115341995606,
"median_itl": 0.22129074099939317,
"std_itl": 0.08917863797006972,
"p99_itl": 0.32682014284888283,
"mean_e2el": 175.7979766966211,
"median_e2el": 172.4868863238953,
"std_e2el": 22.289744861913263,
"p99_e2el": 247.11046467796902
},
{
"hw": "b200-trt",
"conc": 256,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "true",
"tput_per_gpu": 2036.8434315911913,
"output_tput_per_gpu": 1810.7665895799516,
"input_tput_per_gpu": 226.0768420112397,
"mean_ttft": 2.942117972943868,
"median_ttft": 2.1328106684959494,
"std_ttft": 2.606054413703401,
"p99_ttft": 10.918926249645592,
"mean_tpot": 0.033698380299494755,
"mean_intvty": 29.675016754885192,
"median_tpot": 0.03397455250896672,
"median_intvty": 29.433794594824327,
"std_tpot": 0.0007579723276839534,
"std_intvty": 1319.3093777652568,
"p99_tpot": 0.034443960170872916,
"p99_intvty": 29.032666250892863,
"mean_itl": 0.33678295369482636,
"median_itl": 0.33446766101405956,
"std_itl": 0.029839775448208747,
"p99_itl": 0.418679526153137,
"mean_e2el": 251.53306129767907,
"median_e2el": 251.75429335498484,
"std_e2el": 17.319217540189452,
"p99_e2el": 283.7925188251666
},
{
"hw": "mi325x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 47.01676668779743,
"output_tput_per_gpu": 41.7802525344384,
"input_tput_per_gpu": 5.236514153359032,
"mean_ttft": 0.1670265623310115,
"median_ttft": 0.12457656464539468,
"std_ttft": 0.16683505311653923,
"p99_ttft": 0.9848397461138666,
"mean_tpot": 0.023256512670221052,
"mean_intvty": 42.99870811157583,
"median_tpot": 0.023248755175269184,
"median_intvty": 43.01305564367368,
"std_tpot": 0.0001360813287376628,
"std_intvty": 7348.546705682138,
"p99_tpot": 0.023502960416625272,
"p99_intvty": 42.54783151881713,
"mean_itl": 0.02325760657200843,
"median_itl": 0.02319236844778061,
"std_itl": 0.002761949934959673,
"p99_itl": 0.023864173986949025,
"mean_e2el": 171.27323785570917,
"median_e2el": 172.0249329970684,
"std_e2el": 10.737011085998896,
"p99_e2el": 190.258652907582
},
{
"hw": "h200",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 112.33083455859396,
"output_tput_per_gpu": 99.80878386335382,
"input_tput_per_gpu": 12.52205069524014,
"mean_ttft": 0.3561223696509842,
"median_ttft": 0.1272523677907884,
"std_ttft": 0.7041362760431397,
"p99_ttft": 3.0556842187233264,
"mean_tpot": 0.01951660262028362,
"mean_intvty": 51.23842604453601,
"median_tpot": 0.019588329123449932,
"median_intvty": 51.050806513295825,
"std_tpot": 0.00025908132804987305,
"std_intvty": 3859.7918558125516,
"p99_tpot": 0.019837721740184635,
"p99_intvty": 50.40901435643853,
"mean_itl": 0.19502401219910462,
"median_itl": 0.19348819181323051,
"std_itl": 0.017398497228514637,
"p99_itl": 0.2664274647086859,
"mean_e2el": 144.32894245528732,
"median_e2el": 144.1097810762003,
"std_e2el": 9.82024322628906,
"p99_e2el": 161.76840007681398
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 337.10913565337756,
"output_tput_per_gpu": 299.5299820481556,
"input_tput_per_gpu": 37.57915360522196,
"mean_ttft": 0.4129892454831861,
"median_ttft": 0.3584206241648644,
"std_ttft": 0.258392262112766,
"p99_ttft": 1.098296815631911,
"mean_tpot": 0.012995899061695659,
"mean_intvty": 76.94735048746396,
"median_tpot": 0.01296977312965501,
"median_intvty": 77.10235098203292,
"std_tpot": 0.00021710279695218818,
"std_intvty": 4606.112929167958,
"p99_tpot": 0.01338266367268677,
"p99_intvty": 74.7235396822339,
"mean_itl": 0.1298989581076945,
"median_itl": 0.1295618494041264,
"std_itl": 0.017965943343071153,
"p99_itl": 0.22406048603355885,
"mean_e2el": 96.29628129364282,
"median_e2el": 96.45365517027676,
"std_e2el": 6.766638509078871,
"p99_e2el": 108.87544606021605
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 189.38485008132074,
"output_tput_per_gpu": 168.27322295821114,
"input_tput_per_gpu": 21.1116271231096,
"mean_ttft": 0.1920692515299379,
"median_ttft": 0.10599424599786289,
"std_ttft": 0.2838727389584635,
"p99_ttft": 1.4915215353883102,
"mean_tpot": 0.011586163927247432,
"mean_intvty": 86.30984390340606,
"median_tpot": 0.01159949016138027,
"median_intvty": 86.21068564973945,
"std_tpot": 0.00015362085470138357,
"std_intvty": 6509.532849194555,
"p99_tpot": 0.011801062742627433,
"p99_intvty": 84.73813094712487,
"mean_itl": 0.11578407456345063,
"median_itl": 0.11522252799477428,
"std_itl": 0.01258550124207106,
"p99_itl": 0.18700968641322105,
"mean_e2el": 85.66749113114437,
"median_e2el": 86.00221868451627,
"std_e2el": 5.8595389907682485,
"p99_e2el": 95.49856752559658
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 317.18617570373834,
"output_tput_per_gpu": 282.15388530907205,
"input_tput_per_gpu": 35.03229039466629,
"mean_ttft": 0.27693695717243827,
"median_ttft": 0.1094204899854958,
"std_ttft": 0.566020545848356,
"p99_ttft": 2.9231020346429433,
"mean_tpot": 0.013791658906321955,
"mean_intvty": 72.5075936689248,
"median_tpot": 0.013834446782976982,
"median_intvty": 72.28333851632438,
"std_tpot": 0.00022829898711351472,
"std_intvty": 4380.220922762046,
"p99_tpot": 0.014100246563314418,
"p99_intvty": 70.92074564155274,
"mean_itl": 0.13782895491356664,
"median_itl": 0.13664817600511014,
"std_itl": 0.018325485320078647,
"p99_itl": 0.20972779428586363,
"mean_e2el": 102.55334349978075,
"median_e2el": 102.6779627805081,
"std_e2el": 6.711607433610347,
"p99_e2el": 114.2246265227694
},
{
"hw": "mi300x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 216.9495014873866,
"output_tput_per_gpu": 192.86816406486693,
"input_tput_per_gpu": 24.08133742251968,
"mean_ttft": 0.5340816939453361,
"median_ttft": 0.1993749369867146,
"std_ttft": 1.0383541862248338,
"p99_ttft": 4.47203299687244,
"mean_tpot": 0.04021894529819424,
"mean_intvty": 24.863904127413758,
"median_tpot": 0.040444590296304275,
"median_intvty": 24.72518556063547,
"std_tpot": 0.0008976458697565381,
"std_intvty": 1114.0250667796452,
"p99_tpot": 0.041331368560410066,
"p99_intvty": 24.19469847794652,
"mean_itl": 0.04022401601706723,
"median_itl": 0.03929567523300648,
"std_itl": 0.01164454358209665,
"p99_itl": 0.0526023424230516,
"mean_e2el": 297.5951017289306,
"median_e2el": 295.96678702766076,
"std_e2el": 20.689053383253967,
"p99_e2el": 334.01881422945297
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 169.01123197037268,
"output_tput_per_gpu": 150.25102981562875,
"input_tput_per_gpu": 18.760202154743922,
"mean_ttft": 0.16073018482829865,
"median_ttft": 0.11345360850100406,
"std_ttft": 0.2073422998981248,
"p99_ttft": 1.080670743365772,
"mean_tpot": 0.026020436965687,
"mean_intvty": 38.43133000874252,
"median_tpot": 0.026090370230628365,
"median_intvty": 38.3283177341066,
"std_tpot": 0.0003267735574977553,
"std_intvty": 3060.2231332835718,
"p99_tpot": 0.026257363203634337,
"p99_intvty": 38.08455526340085,
"mean_itl": 0.02602116777915961,
"median_itl": 0.025895418017171323,
"std_itl": 0.0033715220691318427,
"p99_itl": 0.0271719610929722,
"mean_e2el": 192.33136381203826,
"median_e2el": 191.4380899359967,
"std_e2el": 12.669841361274088,
"p99_e2el": 213.99566331586394
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 128.9578510105879,
"output_tput_per_gpu": 114.3640191642729,
"input_tput_per_gpu": 14.593831846314984,
"mean_ttft": 0.21393031908664853,
"median_ttft": 0.1660587729420513,
"std_ttft": 0.15485439955137434,
"p99_ttft": 0.7216421717265621,
"mean_tpot": 0.008586395948505661,
"mean_intvty": 116.46329915335848,
"median_tpot": 0.008593602532909993,
"median_intvty": 116.36563317542415,
"std_tpot": 9.562013559652039e-05,
"std_intvty": 10458.048336383974,
"p99_tpot": 0.008706801386305512,
"p99_intvty": 114.85274047629585,
"mean_itl": 0.08582792006901235,
"median_itl": 0.08595622540451586,
"std_itl": 0.007065223774353514,
"p99_itl": 0.0879846709035337,
"mean_e2el": 62.56791406612611,
"median_e2el": 61.883089943788946,
"std_e2el": 4.097433506449553,
"p99_e2el": 70.0528884075256
},
{
"hw": "mi300x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 83.0922652566447,
"output_tput_per_gpu": 73.82957650324362,
"input_tput_per_gpu": 9.262688753401079,
"mean_ttft": 0.2828747011953965,
"median_ttft": 0.15286062099039555,
"std_ttft": 0.4013341044900704,
"p99_ttft": 1.5320151393953712,
"mean_tpot": 0.026303478763998033,
"mean_intvty": 38.017784984726624,
"median_tpot": 0.026271321437002837,
"median_intvty": 38.06432053286487,
"std_tpot": 0.00033414180047623785,
"std_intvty": 2992.741400730897,
"p99_tpot": 0.026939560495414593,
"p99_intvty": 37.120130455365484,
"mean_itl": 0.026304313804208373,
"median_itl": 0.02614685893058777,
"std_itl": 0.004445215755194214,
"p99_itl": 0.02737818192690611,
"mean_e2el": 194.3291398555215,
"median_e2el": 195.04265124630183,
"std_e2el": 13.188711935266607,
"p99_e2el": 217.05931397995911
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 232.6415258713034,
"output_tput_per_gpu": 206.81837788254245,
"input_tput_per_gpu": 25.82314798876095,
"mean_ttft": 0.22444424988607353,
"median_ttft": 0.10785224600112997,
"std_ttft": 0.3305098417120055,
"p99_ttft": 1.624766448875307,
"mean_tpot": 0.018909985708503822,
"mean_intvty": 52.882112943654946,
"median_tpot": 0.018938708232642737,
"median_intvty": 52.80191170992334,
"std_tpot": 0.00036627714043486483,
"std_intvty": 2730.17311102938,
"p99_tpot": 0.019381173403400304,
"p99_intvty": 51.59646318548269,
"mean_itl": 0.018911646387593916,
"median_itl": 0.018504648993257433,
"std_itl": 0.00893513175113899,
"p99_itl": 0.02007971500279382,
"mean_e2el": 139.89008492796717,
"median_e2el": 139.6508659680112,
"std_e2el": 9.539971359400782,
"p99_e2el": 157.0483913678347
},
{
"hw": "h200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 279.29087962659355,
"output_tput_per_gpu": 248.28966567952506,
"input_tput_per_gpu": 31.001213947068493,
"mean_ttft": 0.3747970679862192,
"median_ttft": 0.19016766268759966,
"std_ttft": 0.5438957808962932,
"p99_ttft": 2.28452048279345,
"mean_tpot": 0.03139209388518666,
"mean_intvty": 31.855154474798546,
"median_tpot": 0.03177703680251374,
"median_intvty": 31.46926525008444,
"std_tpot": 0.001127206020708881,
"std_intvty": 887.1492714092467,
"p99_tpot": 0.032375819675427936,
"p99_intvty": 30.887248879723764,
"mean_itl": 0.3137480093402642,
"median_itl": 0.30578782595694065,
"std_itl": 0.04571444066788807,
"p99_itl": 0.4724088748171923,
"mean_e2el": 232.23114417821344,
"median_e2el": 231.08432321855798,
"std_e2el": 17.209948744135882,
"p99_e2el": 263.2135284265969
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 37.82587128077555,
"output_tput_per_gpu": 33.54521368152112,
"input_tput_per_gpu": 4.2806575992544325,
"mean_ttft": 0.17872503836406395,
"median_ttft": 0.1385208028368652,
"std_ttft": 0.11917942314744918,
"p99_ttft": 0.6087467302381993,
"mean_tpot": 0.014652455655733407,
"mean_intvty": 68.24794583894248,
"median_tpot": 0.014690263703755757,
"median_intvty": 68.07229741861863,
"std_tpot": 0.00012774393930777966,
"std_intvty": 7828.160031848177,
"p99_tpot": 0.01484375324049482,
"p99_intvty": 67.36840634563558,
"mean_itl": 0.14641322959787353,
"median_itl": 0.14637880818918347,
"std_itl": 0.00721251392683979,
"p99_itl": 0.1495003769407049,
"mean_e2el": 106.56989814281697,
"median_e2el": 104.75571159925312,
"std_e2el": 6.70141388680696,
"p99_e2el": 119.20701430584305
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 751.4683853781584,
"output_tput_per_gpu": 667.8731344891206,
"input_tput_per_gpu": 83.59525088903774,
"mean_ttft": 0.7804384258474784,
"median_ttft": 0.11994209299155045,
"std_ttft": 2.259486709728604,
"p99_ttft": 11.577278463527621,
"mean_tpot": 0.02325430153448126,
"mean_intvty": 43.002796644621185,
"median_tpot": 0.023506554011959793,
"median_intvty": 42.541326963161616,
"std_tpot": 0.0006963568834303755,
"std_intvty": 1436.0452575320653,
"p99_tpot": 0.023873780901679235,
"p99_intvty": 41.88695557349536,
"mean_itl": 0.2324320633691428,
"median_itl": 0.22792239699629135,
"std_itl": 0.035625907259571626,
"p99_itl": 0.3470631543180207,
"mean_e2el": 172.1015724735223,
"median_e2el": 172.14638177349116,
"std_e2el": 12.812663626188401,
"p99_e2el": 194.05867025838873
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 72.64484782102313,
"output_tput_per_gpu": 64.42381524880611,
"input_tput_per_gpu": 8.221032572217027,
"mean_ttft": 0.10644433652050793,
"median_ttft": 0.0882309228181839,
"std_ttft": 0.05639960297597988,
"p99_ttft": 0.3219009250914678,
"mean_tpot": 0.007627512641056799,
"mean_intvty": 131.10433860407855,
"median_tpot": 0.007610877480949423,
"median_intvty": 131.3908944800481,
"std_tpot": 6.739936650885069e-05,
"std_intvty": 14836.934704255462,
"p99_tpot": 0.007722173098800086,
"p99_intvty": 129.49722664924275,
"mean_itl": 0.07622175277115929,
"median_itl": 0.07631527842022479,
"std_itl": 0.00451419485554493,
"p99_itl": 0.07795069878920913,
"mean_e2el": 55.49298079499276,
"median_e2el": 54.82845988380723,
"std_e2el": 3.5388009207021875,
"p99_e2el": 62.19799215829
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 72.45497688774395,
"output_tput_per_gpu": 64.38527029831904,
"input_tput_per_gpu": 8.06970658942491,
"mean_ttft": 0.20120269414037467,
"median_ttft": 0.13511406630277634,
"std_ttft": 0.22146729090950226,
"p99_ttft": 1.1757622046535834,
"mean_tpot": 0.015089641150840421,
"mean_intvty": 66.27062830744022,
"median_tpot": 0.01510524169764605,
"median_intvty": 66.20218464666054,
"std_tpot": 0.00017068289355613944,
"std_intvty": 5858.817946926177,
"p99_tpot": 0.015451209555224004,
"p99_intvty": 64.71985228249676,
"mean_itl": 0.15076393650396147,
"median_itl": 0.15017970325425267,
"std_itl": 0.010795978438743538,
"p99_itl": 0.15726398676633835,
"mean_e2el": 111.20303532599937,
"median_e2el": 111.60780905536376,
"std_e2el": 6.778592919350848,
"p99_e2el": 124.65554374639875
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 291.7735009484479,
"output_tput_per_gpu": 259.315875998219,
"input_tput_per_gpu": 32.45762495022893,
"mean_ttft": 0.20399228964922714,
"median_ttft": 0.12036599399289116,
"std_ttft": 0.36899555811633583,
"p99_ttft": 2.138748829096439,
"mean_tpot": 0.030255065798685355,
"mean_intvty": 33.052316152736715,
"median_tpot": 0.030395375874916527,
"median_intvty": 32.89974120126738,
"std_tpot": 0.00044547642044333964,
"std_intvty": 2244.7877241286906,
"p99_tpot": 0.03068788325686561,
"p99_intvty": 32.58615107564567,
"mean_itl": 0.030258100427245416,
"median_itl": 0.029965230001835153,
"std_itl": 0.004939472080070983,
"p99_itl": 0.0450567158765625,
"mean_e2el": 223.06613037784328,
"median_e2el": 223.78417757399438,
"std_e2el": 15.392685908697898,
"p99_e2el": 249.43531385624635
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 105.95685849021423,
"output_tput_per_gpu": 94.14534512670615,
"input_tput_per_gpu": 11.811513363508084,
"mean_ttft": 0.17816311827082246,
"median_ttft": 0.14815939801337663,
"std_ttft": 0.14233852906225866,
"p99_ttft": 0.8099191922819591,
"mean_tpot": 0.010433911874161591,
"mean_intvty": 95.84133085083721,
"median_tpot": 0.010422903449935874,
"median_intvty": 95.94255619878668,
"std_tpot": 0.00010294120964691807,
"std_intvty": 9714.282583524495,
"p99_tpot": 0.010573790000239077,
"p99_intvty": 94.57346892432984,
"mean_itl": 0.31236061532549836,
"median_itl": 0.3122116190061206,
"std_itl": 0.0194266638825908,
"p99_itl": 0.39046552873332985,
"mean_e2el": 77.14772310903936,
"median_e2el": 77.32919654202124,
"std_e2el": 5.136851372653764,
"p99_e2el": 85.59046264491946
},
{
"hw": "h200",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 272.0153873524568,
"output_tput_per_gpu": 241.8217511281636,
"input_tput_per_gpu": 30.193636224293186,
"mean_ttft": 0.8576550847108593,
"median_ttft": 0.14478394342586398,
"std_ttft": 2.184414012709969,
"p99_ttft": 9.295266314689071,
"mean_tpot": 0.03209537985061767,
"mean_intvty": 31.157132417635346,
"median_tpot": 0.032143456257679975,
"median_intvty": 31.110531237942777,
"std_tpot": 0.0007564741778968199,
"std_intvty": 1321.922187456868,
"p99_tpot": 0.0334375488993935,
"p99_intvty": 29.906498320459672,
"mean_itl": 0.3207331334579033,
"median_itl": 0.31869961880147457,
"std_itl": 0.04261680408029962,
"p99_itl": 0.3853514727205038,
"mean_e2el": 237.90800582655794,
"median_e2el": 237.55812761606649,
"std_e2el": 16.63787495010176,
"p99_e2el": 267.90412030920385
},
{
"hw": "mi325x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 238.70312176398815,
"output_tput_per_gpu": 212.20713823050394,
"input_tput_per_gpu": 26.495983533484207,
"mean_ttft": 0.4342460937457872,
"median_ttft": 0.1559732384630479,
"std_ttft": 0.849733865297527,
"p99_ttft": 3.7028146919445133,
"mean_tpot": 0.03649294493387885,
"mean_intvty": 27.40255690002242,
"median_tpot": 0.03682609132942983,
"median_intvty": 27.15466029382388,
"std_tpot": 0.0009147719436284604,
"std_intvty": 1093.1686383312992,
"p99_tpot": 0.03726998131806417,
"p99_intvty": 26.83124500294063,
"mean_itl": 0.03649613774649335,
"median_itl": 0.035730197036173195,
"std_itl": 0.007340185428029516,
"p99_itl": 0.06551371487439613,
"mean_e2el": 269.96426765623056,
"median_e2el": 268.896700993937,
"std_e2el": 18.836348292820322,
"p99_e2el": 303.47423627415907
},
{
"hw": "mi300x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 140.53626206645757,
"output_tput_per_gpu": 125.01444074883625,
"input_tput_per_gpu": 15.521821317621317,
"mean_ttft": 0.39575264899613105,
"median_ttft": 0.15791447181254625,
"std_ttft": 0.7256501569275134,
"p99_ttft": 2.6145823072316126,
"mean_tpot": 0.031147340078807418,
"mean_intvty": 32.105470241434766,
"median_tpot": 0.03092490358355661,
"median_intvty": 32.336398310768544,
"std_tpot": 0.0007422230523243446,
"std_intvty": 1347.3038823954626,
"p99_tpot": 0.032631891811052124,
"p99_intvty": 30.644867474747794,
"mean_itl": 0.031149597919825404,
"median_itl": 0.030926089268177748,
"std_itl": 0.00585928299133733,
"p99_itl": 0.03326593008823693,
"mean_e2el": 231.36914498219122,
"median_e2el": 231.3958835969679,
"std_e2el": 15.517551435215182,
"p99_e2el": 262.98786247601265
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 370.83956904563394,
"output_tput_per_gpu": 329.58643396084636,
"input_tput_per_gpu": 41.25313508478757,
"mean_ttft": 2.871163252954011,
"median_ttft": 0.48746024699357804,
"std_ttft": 12.779341786229987,
"p99_ttft": 72.1513117080697,
"mean_tpot": 0.023547437888351633,
"mean_intvty": 42.46746523937861,
"median_tpot": 0.02351658630442595,
"median_intvty": 42.523178621881634,
"std_tpot": 0.001249050677635799,
"std_intvty": 800.6080280848158,
"p99_tpot": 0.03199220216260781,
"p99_intvty": 31.25761693169064,
"mean_itl": 0.7051180004642007,
"median_itl": 0.6915085230139084,
"std_itl": 0.5846586432320419,
"p99_itl": 0.8090700112801278,
"mean_e2el": 176.32302990198843,
"median_e2el": 174.74778018600773,
"std_e2el": 18.811742432053016,
"p99_e2el": 251.71612465727873
},
{
"hw": "mi300x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 25.700915762669407,
"output_tput_per_gpu": 22.79240852563497,
"input_tput_per_gpu": 2.9085072370344385,
"mean_ttft": 0.2565527792321518,
"median_ttft": 0.17535990197211504,
"std_ttft": 0.27947587402652735,
"p99_ttft": 1.2382874147873373,
"mean_tpot": 0.021545256345942575,
"mean_intvty": 46.41392907763295,
"median_tpot": 0.02157106163470035,
"median_intvty": 46.35840446495907,
"std_tpot": 9.649547715423698e-05,
"std_intvty": 10363.180011034241,
"p99_tpot": 0.02167752646003532,
"p99_intvty": 46.1307244552832,
"mean_itl": 0.021546178079160386,
"median_itl": 0.021534583065658808,
"std_itl": 0.0029836740196396787,
"p99_itl": 0.021875930381938816,
"mean_e2el": 156.70442869148684,
"median_e2el": 154.92435785662383,
"std_e2el": 9.920359710526341,
"p99_e2el": 175.23799020726233
},
{
"hw": "mi325x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 87.7440299799892,
"output_tput_per_gpu": 77.9627869585909,
"input_tput_per_gpu": 9.781243021398296,
"mean_ttft": 0.22978735354845412,
"median_ttft": 0.12819749279879034,
"std_ttft": 0.33167256512109494,
"p99_ttft": 1.3423768478445708,
"mean_tpot": 0.024889569232093418,
"mean_intvty": 40.177473168582104,
"median_tpot": 0.02496084875591869,
"median_intvty": 40.062740244875734,
"std_tpot": 0.00017725663041048324,
"std_intvty": 5641.537908535457,
"p99_tpot": 0.02509813052862344,
"p99_intvty": 39.843605039010335,
"mean_itl": 0.0248902494754414,
"median_itl": 0.024797521764412522,
"std_itl": 0.003043663008232259,
"p99_itl": 0.02562406552024186,
"mean_e2el": 183.84453522128462,
"median_e2el": 184.61382162827067,
"std_e2el": 12.314037208743997,
"p99_e2el": 204.32187659421936
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 481.2734975869175,
"output_tput_per_gpu": 427.85226633978965,
"input_tput_per_gpu": 53.42123124712782,
"mean_ttft": 7.459235192634514,
"median_ttft": 0.4154433479998261,
"std_ttft": 21.39014842696261,
"p99_ttft": 72.73708777010907,
"mean_tpot": 0.017226801274327165,
"mean_intvty": 58.04908201328615,
"median_tpot": 0.01730101234901117,
"median_intvty": 57.8000859040572,
"std_tpot": 0.0006126500293270037,
"std_intvty": 1632.2532475816586,
"p99_tpot": 0.017898040593909398,
"p99_intvty": 55.87203776598285,
"mean_itl": 0.17219072664849122,
"median_itl": 0.16707789502106607,
"std_itl": 0.1102218051359972,
"p99_itl": 0.2644927983544767,
"mean_e2el": 134.706298713079,
"median_e2el": 129.72781919653062,
"std_e2el": 22.09196929448126,
"p99_e2el": 205.77356733962893
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 206.20922961941122,
"output_tput_per_gpu": 183.22210914383268,
"input_tput_per_gpu": 22.987120475578536,
"mean_ttft": 0.39921596814983784,
"median_ttft": 0.2932347195164766,
"std_ttft": 0.3592486195682227,
"p99_ttft": 1.4381766899320063,
"mean_tpot": 0.010600415866706668,
"mean_intvty": 94.33592158782723,
"median_tpot": 0.010624815485383663,
"median_intvty": 94.11928154194104,
"std_tpot": 0.0001445589211997582,
"std_intvty": 6917.594512331438,
"p99_tpot": 0.010751147178041041,
"p99_intvty": 93.01332996747323,
"mean_itl": 0.10594558738255881,
"median_itl": 0.10497323349409271,
"std_itl": 0.015537211059806886,
"p99_itl": 0.18451830852980491,
"mean_e2el": 78.60162682465581,
"median_e2el": 78.71777263749391,
"std_e2el": 5.35430738856339,
"p99_e2el": 87.64544803837896
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 143.23887983925238,
"output_tput_per_gpu": 127.41856225068777,
"input_tput_per_gpu": 15.820317588564606,
"mean_ttft": 0.21631312640920441,
"median_ttft": 0.1885005980002461,
"std_ttft": 0.25158411202557107,
"p99_ttft": 1.2616269485565135,
"mean_tpot": 0.015454227666734127,
"mean_intvty": 64.7072129105838,
"median_tpot": 0.015465505294356795,
"median_intvty": 64.6600276529529,
"std_tpot": 0.0001366415531226029,
"std_intvty": 7318.417985945613,
"p99_tpot": 0.015655817061583265,
"p99_intvty": 63.87402178157992,
"mean_itl": 0.015455783898267579,
"median_itl": 0.015222205023746938,
"std_itl": 0.006914263025323176,
"p99_itl": 0.015951020401553252,
"mean_e2el": 114.82051588058184,
"median_e2el": 115.03829742001835,
"std_e2el": 7.342498825770811,
"p99_e2el": 127.08193607952067
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 100.21602825982659,
"output_tput_per_gpu": 89.14745947239773,
"input_tput_per_gpu": 11.068568787428859,
"mean_ttft": 0.12905485950368528,
"median_ttft": 0.10672473450540565,
"std_ttft": 0.1028732064278289,
"p99_ttft": 0.6232913572076358,
"mean_tpot": 0.022109353204534343,
"mean_intvty": 45.2297265663526,
"median_tpot": 0.02205030003037304,
"median_intvty": 45.35085684197297,
"std_tpot": 0.0002551726303133879,
"std_intvty": 3918.915593619344,
"p99_tpot": 0.02253936223665486,
"p99_intvty": 44.36682766355031,
"mean_itl": 0.022110539034913293,
"median_itl": 0.02201301700551994,
"std_itl": 0.002343171721619019,
"p99_itl": 0.022936347106588074,
"mean_e2el": 164.0780797397916,
"median_e2el": 163.90020212550007,
"std_e2el": 10.476667072938682,
"p99_e2el": 182.10763900234073
},
{
"hw": "h200",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 40.42402364024439,
"output_tput_per_gpu": 35.849339749856426,
"input_tput_per_gpu": 4.574683890387966,
"mean_ttft": 0.48911951053887603,
"median_ttft": 0.10203479044139385,
"std_ttft": 1.0713694228466197,
"p99_ttft": 3.6982173093780877,
"mean_tpot": 0.013672997024623984,
"mean_intvty": 73.1368549410988,
"median_tpot": 0.01371320670397072,
"median_intvty": 72.92240404357396,
"std_tpot": 0.0002181576119590198,
"std_intvty": 4583.841888532621,
"p99_tpot": 0.014022524432134096,
"p99_intvty": 71.31383545379278,
"mean_itl": 0.13662430664888045,
"median_itl": 0.1365978205576539,
"std_itl": 0.015297202975666524,
"p99_itl": 0.14184080446138977,
"mean_e2el": 99.76717168330215,
"median_e2el": 98.5945138707757,
"std_e2el": 6.533768309742044,
"p99_e2el": 111.34121708292514
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 507.8380341825396,
"output_tput_per_gpu": 451.46814638240636,
"input_tput_per_gpu": 56.369887800133256,
"mean_ttft": 0.42506809979290666,
"median_ttft": 0.11144440950010903,
"std_ttft": 1.0692947621697118,
"p99_ttft": 5.498525103374852,
"mean_tpot": 0.017169425628103974,
"mean_intvty": 58.24306658011544,
"median_tpot": 0.017319172106931,
"median_intvty": 57.739480491668985,
"std_tpot": 0.0004246375789036429,
"std_intvty": 2354.949372549329,
"p99_tpot": 0.017521563376453227,
"p99_intvty": 57.0725327708984,
"mean_itl": 0.1715816429166972,
"median_itl": 0.1684428790176753,
"std_itl": 0.02531842391403556,
"p99_itl": 0.24441160891554317,
"mean_e2el": 127.23918353123825,
"median_e2el": 126.79153103700082,
"std_e2el": 8.889800542946007,
"p99_e2el": 142.69544483930866
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 493.8308106339527,
"output_tput_per_gpu": 439.28863418008086,
"input_tput_per_gpu": 54.54217645387183,
"mean_ttft": 0.22001645321724936,
"median_ttft": 0.09253562591038644,
"std_ttft": 0.43402926346726955,
"p99_ttft": 2.2601093718828635,
"mean_tpot": 0.017776048086774014,
"mean_intvty": 56.25547338297504,
"median_tpot": 0.01777047299715847,
"median_intvty": 56.27312228323365,
"std_tpot": 0.00042637665707627665,
"std_intvty": 2345.344153821969,
"p99_tpot": 0.01856019175609441,
"p99_intvty": 53.8787536864559,
"mean_itl": 0.17764056015670848,
"median_itl": 0.17828281596302986,
"std_itl": 0.014537443480387704,
"p99_itl": 0.2240007436275482,
"mean_e2el": 132.038749088232,
"median_e2el": 132.0842475811951,
"std_e2el": 8.903412700270477,
"p99_e2el": 148.42990054060704
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 71.50904647176564,
"output_tput_per_gpu": 63.41654964802756,
"input_tput_per_gpu": 8.092496823738081,
"mean_ttft": 0.23585716856177896,
"median_ttft": 0.16228120843879879,
"std_ttft": 0.22650684710420838,
"p99_ttft": 0.9727092415117659,
"mean_tpot": 0.0077327762167209045,
"mean_intvty": 129.3196611376983,
"median_tpot": 0.007737867681909417,
"median_intvty": 129.23456966548144,
"std_tpot": 5.775409995985195e-05,
"std_intvty": 17314.788053058655,
"p99_tpot": 0.0077997137425376405,
"p99_intvty": 128.20983346430475,
"mean_itl": 0.07729228148225654,
"median_itl": 0.07709924201481044,
"std_itl": 0.0072439160326869135,
"p99_itl": 0.0789333217474632,
"mean_e2el": 56.38869945516344,
"median_e2el": 56.063278557616286,
"std_e2el": 3.6459918323687903,
"p99_e2el": 62.807044222501574
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1159.5478784445231,
"output_tput_per_gpu": 1030.556828784274,
"input_tput_per_gpu": 128.99104966024902,
"mean_ttft": 0.6193724907367141,
"median_ttft": 0.12335171387530863,
"std_ttft": 1.7086603163308625,
"p99_ttft": 8.785017051231119,
"mean_tpot": 0.030201921126960952,
"mean_intvty": 33.110476508969825,
"median_tpot": 0.030495277779910325,
"median_intvty": 32.79196232338568,
"std_tpot": 0.0007844207925992189,
"std_intvty": 1274.8259727874479,
"p99_tpot": 0.03091470544870734,
"p99_intvty": 32.34706543328278,
"mean_itl": 0.3018580616903422,
"median_itl": 0.3036098852753639,
"std_itl": 0.02974386703319824,
"p99_itl": 0.3768176507204771,
"mean_e2el": 223.11305392481046,
"median_e2el": 223.63744522770867,
"std_e2el": 16.255050400115127,
"p99_e2el": 251.49620704594534
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 55.24373276524562,
"output_tput_per_gpu": 49.08545196016652,
"input_tput_per_gpu": 6.1582808050790945,
"mean_ttft": 0.11430331721257972,
"median_ttft": 0.10377160750795156,
"std_ttft": 0.05087097783142219,
"p99_ttft": 0.4279247227599262,
"mean_tpot": 0.020003516319770227,
"mean_intvty": 49.991210745865835,
"median_tpot": 0.02001340986552734,
"median_intvty": 49.966497799182044,
"std_tpot": 0.0002023554710369502,
"std_intvty": 4941.798681674387,
"p99_tpot": 0.020399422245109794,
"p99_intvty": 49.02099618236604,
"mean_itl": 0.020003501757422475,
"median_itl": 0.01998833850666415,
"std_itl": 0.0017010912645314903,
"p99_itl": 0.02068295399512863,
"mean_e2el": 147.67963551012562,
"median_e2el": 148.09807548101526,
"std_e2el": 9.88927038367951,
"p99_e2el": 164.30006840537098
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 123.67407766442474,
"output_tput_per_gpu": 109.89982001728063,
"input_tput_per_gpu": 13.774257647144111,
"mean_ttft": 0.15845683613733855,
"median_ttft": 0.10889484250219539,
"std_ttft": 0.16117593159370863,
"p99_ttft": 0.8552776318872929,
"mean_tpot": 0.008852888569671074,
"mean_intvty": 112.95748185805469,
"median_tpot": 0.008866689599088002,
"median_intvty": 112.7816631928625,
"std_tpot": 8.357842398557749e-05,
"std_intvty": 11964.810441658514,
"p99_tpot": 0.008960612357056717,
"p99_intvty": 111.59951576439681,
"mean_itl": 0.08846228174596531,
"median_itl": 0.08800375199643895,
"std_itl": 0.008783482365832153,
"p99_itl": 0.09510139000485651,
"mean_e2el": 65.28991736308853,
"median_e2el": 65.36416252348863,
"std_e2el": 4.072927807813436,
"p99_e2el": 73.01343046826136
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 167.64031383863824,
"output_tput_per_gpu": 149.12492884993821,
"input_tput_per_gpu": 18.51538498870002,
"mean_ttft": 0.3599706758293905,
"median_ttft": 0.32331540738232434,
"std_ttft": 0.23735571388658588,
"p99_ttft": 1.2882861181348562,
"mean_tpot": 0.013201778526031455,
"mean_intvty": 75.74736979780306,
"median_tpot": 0.013214981195438581,
"median_intvty": 75.67169299833513,
"std_tpot": 0.00018406833967964386,
"std_intvty": 5432.764818438736,
"p99_tpot": 0.013412161026598613,
"p99_intvty": 74.55920026734161,
"mean_itl": 0.39532086919880477,
"median_itl": 0.3929669433273375,
"std_itl": 0.028288749684703105,
"p99_itl": 0.486848467560485,
"mean_e2el": 98.25500689553591,
"median_e2el": 98.45488342223689,
"std_e2el": 6.288217494394471,
"p99_e2el": 108.50540081015788
},
{
"hw": "mi325x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 149.07504678331114,
"output_tput_per_gpu": 132.61014153350186,
"input_tput_per_gpu": 16.464905249809277,
"mean_ttft": 0.3549396021608118,
"median_ttft": 0.1474674174678512,
"std_ttft": 0.6331543550241097,
"p99_ttft": 2.293132298839046,
"mean_tpot": 0.02933238956004298,
"mean_intvty": 34.092005970158496,
"median_tpot": 0.02924244110047285,
"median_intvty": 34.19687147745781,
"std_tpot": 0.0005862761801249004,
"std_intvty": 1705.6807591721702,
"p99_tpot": 0.030131263042999318,
"p99_intvty": 33.18812087541546,
"mean_itl": 0.029336469494843566,
"median_itl": 0.029124516993761063,
"std_itl": 0.0046986278167952715,
"p99_itl": 0.03548776999814438,
"mean_e2el": 217.88403548263085,
"median_e2el": 217.9885662390734,
"std_e2el": 14.474779969531548,
"p99_e2el": 243.96799409227913
},
{
"hw": "h200",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 173.5435503859022,
"output_tput_per_gpu": 154.37617009339257,
"input_tput_per_gpu": 19.16738029250962,
"mean_ttft": 0.6128344889613799,
"median_ttft": 0.12730724876746535,
"std_ttft": 1.4479205826595112,
"p99_ttft": 5.85464629419148,
"mean_tpot": 0.025241350749060677,
"mean_intvty": 39.61753116707566,
"median_tpot": 0.025358806835870537,
"median_intvty": 39.434031990238594,
"std_tpot": 0.0005149257964414731,
"std_intvty": 1942.027389015576,
"p99_tpot": 0.026054265867462355,
"p99_intvty": 38.38143070647181,
"mean_itl": 0.25222450501119986,
"median_itl": 0.25118887796998024,
"std_itl": 0.0381622812787615,
"p99_itl": 0.3103050826117396,
"mean_e2el": 187.77681645620032,
"median_e2el": 187.84593547368422,
"std_e2el": 12.344801762110093,
"p99_e2el": 210.61277790937572
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 214.3390348417068,
"output_tput_per_gpu": 190.4669256212058,
"input_tput_per_gpu": 23.872109220500988,
"mean_ttft": 0.23286080496618525,
"median_ttft": 0.1716257450170815,
"std_ttft": 0.19020944256014014,
"p99_ttft": 0.7953700893186033,
"mean_tpot": 0.010228530696746962,
"mean_intvty": 97.76575244751778,
"median_tpot": 0.010280154489784091,
"median_intvty": 97.27480272730827,
"std_tpot": 0.00019650053233339216,
"std_intvty": 5089.04473756515,
"p99_tpot": 0.01048336180778975,
"p99_intvty": 95.38924806133673,
"mean_itl": 0.10220962860786612,
"median_itl": 0.1018959719222039,
"std_itl": 0.01217698474079439,
"p99_itl": 0.11139243337325751,
"mean_e2el": 75.47447873156052,
"median_e2el": 75.97487641870975,
"std_e2el": 4.712139781913288,
"p99_e2el": 84.85427310601808
},
{
"hw": "b200",
"conc": 128,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1169.0206778284037,
"output_tput_per_gpu": 1038.9714278313097,
"input_tput_per_gpu": 130.04924999709397,
"mean_ttft": 5.701686214784146,
"median_ttft": 0.6020273839967558,
"std_ttft": 18.129279350747098,
"p99_ttft": 70.63496936401584,
"mean_tpot": 0.029253464012578403,
"mean_intvty": 34.18398585446223,
"median_tpot": 0.029347979945975503,
"median_intvty": 34.073895438146856,
"std_tpot": 0.001523178051748064,
"std_intvty": 656.5220650680709,
"p99_tpot": 0.03746227798088672,
"p99_intvty": 26.69351822412403,
"mean_itl": 0.2924172466932119,
"median_itl": 0.282413237990113,
"std_itl": 0.4186543280370853,
"p99_itl": 0.38571694021229636,
"mean_e2el": 221.20747669783492,
"median_e2el": 218.44746757950634,
"std_e2el": 23.88923366769935,
"p99_e2el": 295.8263828896516
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 745.7092304526851,
"output_tput_per_gpu": 662.932295579202,
"input_tput_per_gpu": 82.77693487348313,
"mean_ttft": 7.6850932168738595,
"median_ttft": 0.5199233749881387,
"std_ttft": 21.77810783509573,
"p99_ttft": 74.14294052171986,
"mean_tpot": 0.022488335723987698,
"mean_intvty": 44.46749694035059,
"median_tpot": 0.022727286641358,
"median_intvty": 43.99997306234741,
"std_tpot": 0.0008384036833017164,
"std_intvty": 1192.742851584217,
"p99_tpot": 0.023485360843916437,
"p99_intvty": 42.57971621751924,
"mean_itl": 0.2247697740112123,
"median_itl": 0.2215120104374364,
"std_itl": 0.11300552574812044,
"p99_itl": 0.3262464260542765,
"mean_e2el": 173.77483073009523,
"median_e2el": 169.7767184630502,
"std_e2el": 22.20421556462337,
"p99_e2el": 243.4324492432736
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 213.9092834240448,
"output_tput_per_gpu": 190.08503796661276,
"input_tput_per_gpu": 23.824245457432028,
"mean_ttft": 0.15428784805299073,
"median_ttft": 0.11009138000372332,
"std_ttft": 0.14714097796760167,
"p99_ttft": 0.800635676436068,
"mean_tpot": 0.010256931780432719,
"mean_intvty": 97.49504251434263,
"median_tpot": 0.010232394649784721,
"median_intvty": 97.72883418067137,
"std_tpot": 0.00017737056177883758,
"std_intvty": 5637.91414973864,
"p99_tpot": 0.010534880752731783,
"p99_intvty": 94.92276405128663,
"mean_itl": 0.10247506053731165,
"median_itl": 0.102041463018395,
"std_itl": 0.008811655259432687,
"p99_itl": 0.11037171899806708,
"mean_e2el": 75.60283188757822,
"median_e2el": 75.79459682649758,
"std_e2el": 4.674410612447948,
"p99_e2el": 85.13889234709116
},
{
"hw": "h200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 82.35243706592257,
"output_tput_per_gpu": 73.18039626773567,
"input_tput_per_gpu": 9.172040798186899,
"mean_ttft": 0.15282060297904537,
"median_ttft": 0.11794836039189249,
"std_ttft": 0.10640877923339227,
"p99_ttft": 0.4929962460929528,
"mean_tpot": 0.013307281646051277,
"mean_intvty": 75.14682762401245,
"median_tpot": 0.013302298252553194,
"median_intvty": 75.1749796173803,
"std_tpot": 0.00030476296019311613,
"std_intvty": 3281.238636632024,
"p99_tpot": 0.013834435627893863,
"p99_intvty": 72.28339680035353,
"mean_itl": 0.13298224435266362,
"median_itl": 0.13288164208643138,
"std_itl": 0.010239091859559768,
"p99_itl": 0.14251333486987278,
"mean_e2el": 98.04769963698928,
"median_e2el": 98.53893088561017,
"std_e2el": 6.363193774011107,
"p99_e2el": 110.14182019345228
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 210.73739795323533,
"output_tput_per_gpu": 187.46206539583636,
"input_tput_per_gpu": 23.27533255739897,
"mean_ttft": 0.37232352749852,
"median_ttft": 0.139869321603328,
"std_ttft": 0.7705771290635787,
"p99_ttft": 3.8957845793198795,
"mean_tpot": 0.02075428402270186,
"mean_intvty": 48.182823310414385,
"median_tpot": 0.020685438104923237,
"median_intvty": 48.34318687995277,
"std_tpot": 0.0004094031099739064,
"std_intvty": 2442.58037039273,
"p99_tpot": 0.021476557116910704,
"p99_intvty": 46.56239799314002,
"mean_itl": 0.20740986191706975,
"median_itl": 0.20511164935305715,
"std_itl": 0.022423491245457533,
"p99_itl": 0.2903660201281309,
"mean_e2el": 154.28145955922082,
"median_e2el": 154.78393408446573,
"std_e2el": 10.226985375739313,
"p99_e2el": 173.39835903049914
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 128.4970462962306,
"output_tput_per_gpu": 113.95536254685287,
"input_tput_per_gpu": 14.541683749377725,
"mean_ttft": 0.13394186227596946,
"median_ttft": 0.11219134999555536,
"std_ttft": 0.06999313235081384,
"p99_ttft": 0.39899931744090283,
"mean_tpot": 0.008627246319232439,
"mean_intvty": 115.91184058008551,
"median_tpot": 0.008640788311393338,
"median_intvty": 115.73018154853382,
"std_tpot": 6.943361723646516e-05,
"std_intvty": 14402.245479943394,
"p99_tpot": 0.008722108638023687,
"p99_intvty": 114.65117456122245,
"mean_itl": 0.08620798469311157,
"median_itl": 0.08627977549622301,
"std_itl": 0.005650660867934915,
"p99_itl": 0.08946500154270325,
"mean_e2el": 62.776973726350846,
"median_e2el": 62.09035408651107,
"std_e2el": 3.9495410656955348,
"p99_e2el": 70.02393900983064
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 493.6044989774338,
"output_tput_per_gpu": 439.0873179876716,
"input_tput_per_gpu": 54.51718098976221,
"mean_ttft": 7.432321993267396,
"median_ttft": 0.40874849166721106,
"std_ttft": 21.51742527067161,
"p99_ttft": 73.31201935719932,
"mean_tpot": 0.016823654216435828,
"mean_intvty": 59.44011848644942,
"median_tpot": 0.016811768749865032,
"median_intvty": 59.482141045273906,
"std_tpot": 0.0006464238898084782,
"std_intvty": 1546.9725295831486,
"p99_tpot": 0.01743925210959571,
"p99_intvty": 57.34190857013665,
"mean_itl": 0.16819094971662954,
"median_itl": 0.16730563016608357,
"std_itl": 0.15099456376975748,
"p99_itl": 0.26420700992457563,
"mean_e2el": 132.22002020914223,
"median_e2el": 127.03417874488514,
"std_e2el": 21.55887190076877,
"p99_e2el": 203.53194936962566
},
{
"hw": "h200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 47.33511143164344,
"output_tput_per_gpu": 41.97831732220013,
"input_tput_per_gpu": 5.356794109443314,
"mean_ttft": 0.1473709739279002,
"median_ttft": 0.11576855648308992,
"std_ttft": 0.09319836141002832,
"p99_ttft": 0.4667519399523735,
"mean_tpot": 0.011710144772835798,
"mean_intvty": 85.39604073211079,
"median_tpot": 0.011778905275371274,
"median_intvty": 84.89753305775521,
"std_tpot": 0.00015957424745397487,
"std_intvty": 6266.675331108327,
"p99_tpot": 0.011935370826058457,
"p99_intvty": 83.78457733518454,
"mean_itl": 0.11704383502680363,
"median_itl": 0.11713421251624823,
"std_itl": 0.006826939533740474,
"p99_itl": 0.12065608169883489,
"mean_e2el": 85.17971696318128,
"median_e2el": 84.08774134423584,
"std_e2el": 5.51207817854129,
"p99_e2el": 95.46320932446048
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 125.17624761028279,
"output_tput_per_gpu": 111.22225782132352,
"input_tput_per_gpu": 13.953989788959262,
"mean_ttft": 0.25593930239556356,
"median_ttft": 0.1380374883301556,
"std_ttft": 0.39730918595140496,
"p99_ttft": 2.0552143385726955,
"mean_tpot": 0.0174800674256679,
"mean_intvty": 57.2080173175757,
"median_tpot": 0.017553941170248964,
"median_intvty": 56.96726394952463,
"std_tpot": 0.0002754686115410233,
"std_intvty": 3630.1776612798517,
"p99_tpot": 0.01783270721776973,
"p99_intvty": 56.076735169157686,
"mean_itl": 0.17466602872521547,
"median_itl": 0.17322744010016322,
"std_itl": 0.015957713575060248,
"p99_itl": 0.26186478717252604,
"mean_e2el": 129.19985986636894,
"median_e2el": 129.21294432575814,
"std_e2el": 8.752435232499215,
"p99_e2el": 144.72183022453916
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 365.70929314114,
"output_tput_per_gpu": 325.02778893938785,
"input_tput_per_gpu": 40.68150420175215,
"mean_ttft": 0.2809388529907703,
"median_ttft": 0.11732633999781683,
"std_ttft": 0.5982810996057047,
"p99_ttft": 3.1583510815049523,
"mean_tpot": 0.0241420347523864,
"mean_intvty": 41.42152930589877,
"median_tpot": 0.024216005153343066,
"median_intvty": 41.295002774722654,
"std_tpot": 0.0004161867474784109,
"std_intvty": 2402.7675221731406,
"p99_tpot": 0.02464013332070945,
"p99_intvty": 40.58419599375802,
"mean_itl": 0.0241465946993526,
"median_itl": 0.0235596009879373,
"std_itl": 0.010480329991294909,
"p99_itl": 0.027628268278785898,
"mean_e2el": 178.1354855488883,
"median_e2el": 178.75642251200043,
"std_e2el": 12.599977100914721,
"p99_e2el": 200.14235153985558
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 789.244958656121,
"output_tput_per_gpu": 701.6389764892265,
"input_tput_per_gpu": 87.60598216689448,
"mean_ttft": 0.3545958353795868,
"median_ttft": 0.09967933059670031,
"std_ttft": 0.8654790310348708,
"p99_ttft": 4.458355265744031,
"mean_tpot": 0.022123210895199138,
"mean_intvty": 45.201395255740465,
"median_tpot": 0.022388261645355387,
"median_intvty": 44.66626376985627,
"std_tpot": 0.0006012284469970078,
"std_intvty": 1663.2612861130588,
"p99_tpot": 0.022742752988220066,
"p99_intvty": 43.97005061427543,
"mean_itl": 0.22107004319614063,
"median_itl": 0.22208417719230056,
"std_itl": 0.02048052922756916,
"p99_itl": 0.2738785695843398,
"mean_e2el": 163.74504663605984,
"median_e2el": 163.376719574444,
"std_e2el": 11.367200264363568,
"p99_e2el": 184.22146883308423
},
{
"hw": "mi325x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 25.00926941331968,
"output_tput_per_gpu": 22.179034033643493,
"input_tput_per_gpu": 2.8302353796761857,
"mean_ttft": 0.1945650352019584,
"median_ttft": 0.1354355959338136,
"std_ttft": 0.20913369654259453,
"p99_ttft": 0.9296139728580601,
"mean_tpot": 0.022149951056749294,
"mean_intvty": 45.14682661997534,
"median_tpot": 0.02216399624498401,
"median_intvty": 45.11821735334902,
"std_tpot": 0.00010546706191945704,
"std_intvty": 9481.63323980409,
"p99_tpot": 0.022291518162110377,
"p99_intvty": 44.860111937092405,
"mean_itl": 0.02215067676005398,
"median_itl": 0.02212534553837031,
"std_itl": 0.0019966941783802097,
"p99_itl": 0.02381244889926166,
"mean_e2el": 161.03173614102997,
"median_e2el": 158.98566846200265,
"std_e2el": 10.168315417749467,
"p99_e2el": 180.1862952242163
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 29.472950292493724,
"output_tput_per_gpu": 26.190386810986894,
"input_tput_per_gpu": 3.2825634815068305,
"mean_ttft": 0.10637625904782907,
"median_ttft": 0.1016932979982812,
"std_ttft": 0.020159473178362277,
"p99_ttft": 0.1935302241658792,
"mean_tpot": 0.018729165945885133,
"mean_intvty": 53.392660564241716,
"median_tpot": 0.018764340528389677,
"median_intvty": 53.29257367116318,
"std_tpot": 0.0001156361102950349,
"std_intvty": 8647.817688164985,
"p99_tpot": 0.01888723413503586,
"p99_intvty": 52.94581476834651,
"mean_itl": 0.018728510230564786,
"median_itl": 0.01873570599127561,
"std_itl": 0.0010659427090178933,
"p99_itl": 0.01914182021981105,
"mean_e2el": 137.89202582143844,
"median_e2el": 138.16276840300998,
"std_e2el": 8.521614932415599,
"p99_e2el": 153.42232726934134
},
{
"hw": "h200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 130.7058574837833,
"output_tput_per_gpu": 116.1354558660242,
"input_tput_per_gpu": 14.570401617759089,
"mean_ttft": 0.18347789350082166,
"median_ttft": 0.13056494342163205,
"std_ttft": 0.15433891308751244,
"p99_ttft": 0.6595582411345094,
"mean_tpot": 0.016801224437803026,
"mean_intvty": 59.51947155410792,
"median_tpot": 0.01686199734125893,
"median_intvty": 59.30495538349665,
"std_tpot": 0.00034037691830393835,
"std_intvty": 2937.919542203075,
"p99_tpot": 0.017401916300491663,
"p99_intvty": 57.46493562733356,
"mean_itl": 0.16793039509483657,
"median_itl": 0.166595462244004,
"std_itl": 0.01593322893515242,
"p99_itl": 0.2687387232482433,
"mean_e2el": 124.1391997367784,
"median_e2el": 124.80164737161249,
"std_e2el": 8.781209943322265,
"p99_e2el": 140.20656632996165
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 42.24317623796109,
"output_tput_per_gpu": 37.53832293059118,
"input_tput_per_gpu": 4.704853307369909,
"mean_ttft": 0.24389379356252902,
"median_ttft": 0.19006067801092286,
"std_ttft": 0.2318422606966532,
"p99_ttft": 1.0989260780127312,
"mean_tpot": 0.013045875983248575,
"mean_intvty": 76.65257597757635,
"median_tpot": 0.013060339684729663,
"median_intvty": 76.5676869162304,
"std_tpot": 8.099382354120008e-05,
"std_intvty": 12346.620473983652,
"p99_tpot": 0.01315073452013815,
"p99_intvty": 76.04137992966608,
"mean_itl": 0.013045959113177887,
"median_itl": 0.013003522501094267,
"std_itl": 0.0041081421779227785,
"p99_itl": 0.013432815239939375,
"mean_e2el": 96.22301479596281,
"median_e2el": 96.3880987904995,
"std_e2el": 6.002918542016658,
"p99_e2el": 107.18358405745734
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 255.60343253259444,
"output_tput_per_gpu": 227.23151896297733,
"input_tput_per_gpu": 28.371913569617107,
"mean_ttft": 0.49488446963368915,
"median_ttft": 0.3946838341653347,
"std_ttft": 0.4968762937136306,
"p99_ttft": 2.5042326218634843,
"mean_tpot": 0.01723697602012324,
"mean_intvty": 58.01481645229151,
"median_tpot": 0.0172685510009224,
"median_intvty": 57.908738257574996,
"std_tpot": 0.000365233226292827,
"std_intvty": 2737.976525712495,
"p99_tpot": 0.017690061627360622,
"p99_intvty": 56.5289155608895,
"mean_itl": 0.5160950759614041,
"median_itl": 0.5094273081049323,
"std_itl": 0.04580699748870072,
"p99_itl": 0.6133324337750673,
"mean_e2el": 127.79699238625399,
"median_e2el": 127.41039985604584,
"std_e2el": 8.702541429006956,
"p99_e2el": 143.01392535069957
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 340.6079895135305,
"output_tput_per_gpu": 302.80059254765985,
"input_tput_per_gpu": 37.80739696587068,
"mean_ttft": 0.5616503246455977,
"median_ttft": 0.14467595354653895,
"std_ttft": 1.4110155581237322,
"p99_ttft": 7.2114844815246775,
"mean_tpot": 0.025579447245190446,
"mean_intvty": 39.09388621319892,
"median_tpot": 0.02570034425588219,
"median_intvty": 38.90998463069708,
"std_tpot": 0.0006600501662202635,
"std_intvty": 1515.0363581096242,
"p99_tpot": 0.026338630899028995,
"p99_intvty": 37.96704558538258,
"mean_itl": 0.2556301489731067,
"median_itl": 0.2522624786943197,
"std_itl": 0.031055477690684903,
"p99_itl": 0.3474832373112439,
"mean_e2el": 189.49509731409051,
"median_e2el": 188.76622377941385,
"std_e2el": 13.380831363675671,
"p99_e2el": 213.90821056747808
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 124.91914945916785,
"output_tput_per_gpu": 111.0062213645552,
"input_tput_per_gpu": 13.912928094612653,
"mean_ttft": 0.27092938439454883,
"median_ttft": 0.16794981097336859,
"std_ttft": 0.307751596931458,
"p99_ttft": 1.2043114202679135,
"mean_tpot": 0.008751911721316771,
"mean_intvty": 114.26075031862236,
"median_tpot": 0.008759890339391247,
"median_intvty": 114.15668019304145,
"std_tpot": 9.248156543134721e-05,
"std_intvty": 10812.965755238434,
"p99_tpot": 0.008911993879958707,
"p99_intvty": 112.20833558344337,
"mean_itl": 0.08746984302891872,
"median_itl": 0.0868434295989573,
"std_itl": 0.012977793453391047,
"p99_itl": 0.09126268964726478,
"mean_e2el": 64.66185414382198,
"median_e2el": 64.57978936447762,
"std_e2el": 4.063936262635185,
"p99_e2el": 72.43876064793672
},
{
"hw": "b200-trt",
"conc": 256,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 1290.0378209854625,
"output_tput_per_gpu": 1146.8517163884992,
"input_tput_per_gpu": 143.18610459696333,
"mean_ttft": 2.0123381735887165,
"median_ttft": 1.5027231744898017,
"std_ttft": 1.6890939632706135,
"p99_ttft": 8.367195134090723,
"mean_tpot": 0.0265071901250993,
"mean_intvty": 37.725613136683,
"median_tpot": 0.026678426933793305,
"median_intvty": 37.483469414506956,
"std_tpot": 0.0004886807093079002,
"std_intvty": 2046.3259157830514,
"p99_tpot": 0.026979852392687412,
"p99_intvty": 37.064694997035595,
"mean_itl": 0.26490584294551484,
"median_itl": 0.2596212279750034,
"std_itl": 0.03373991027504487,
"p99_itl": 0.35413253950420776,
"mean_e2el": 197.5483433707905,
"median_e2el": 197.47766612800478,
"std_e2el": 13.278730246378927,
"p99_e2el": 221.44565634200288
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 80.3206941509057,
"output_tput_per_gpu": 71.3669655688405,
"input_tput_per_gpu": 8.953728582065196,
"mean_ttft": 0.26120445321103036,
"median_ttft": 0.1882245889864862,
"std_ttft": 0.36479455683614453,
"p99_ttft": 1.901500097619719,
"mean_tpot": 0.013729653244716694,
"mean_intvty": 72.83505141579667,
"median_tpot": 0.013756569518621161,
"median_intvty": 72.69254145420342,
"std_tpot": 0.0001116336513328328,
"std_intvty": 8957.872362505876,
"p99_tpot": 0.013915217290199499,
"p99_intvty": 71.86377180788266,
"mean_itl": 0.01373138036098023,
"median_itl": 0.013627767009893432,
"std_itl": 0.005475939114372205,
"p99_itl": 0.014192348168580794,
"mean_e2el": 101.55725391327833,
"median_e2el": 101.52365214299061,
"std_e2el": 6.931600963079937,
"p99_e2el": 112.80110447025334
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 8192,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 311.8440235204573,
"output_tput_per_gpu": 277.4017582938236,
"input_tput_per_gpu": 34.44226522663371,
"mean_ttft": 7.2506422571794245,
"median_ttft": 0.3336782940023113,
"std_ttft": 21.164970340579018,
"p99_ttft": 72.05787785650463,
"mean_tpot": 0.013124735347766977,
"mean_intvty": 76.19201252466691,
"median_tpot": 0.013210795163434479,
"median_intvty": 75.6956706715014,
"std_tpot": 0.0005719620287032628,
"std_intvty": 1748.3678108268366,
"p99_tpot": 0.013473507168303757,
"p99_intvty": 74.2197252362389,
"mean_itl": 0.13121283411370652,
"median_itl": 0.12904534899280407,
"std_itl": 0.14721598671593747,
"p99_itl": 0.2183757746068295,
"mean_e2el": 104.60277428515101,
"median_e2el": 99.36814783599402,
"std_e2el": 20.67191180470279,
"p99_e2el": 172.66963572183536
}
]
[
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 490.7455225705645,
"output_tput_per_gpu": 55.168073772806416,
"input_tput_per_gpu": 435.5774487977581,
"mean_ttft": 0.47278218367137015,
"median_ttft": 0.32939656591042876,
"std_ttft": 0.38185631954500926,
"p99_ttft": 2.1659555163606994,
"mean_tpot": 0.01729677032393398,
"mean_intvty": 57.81426134891059,
"median_tpot": 0.01739217080013858,
"median_intvty": 57.49713543475735,
"std_tpot": 0.0006798479662186078,
"std_intvty": 1470.917101601575,
"p99_tpot": 0.018531012508799273,
"p99_intvty": 53.96359208786674,
"mean_itl": 0.1721585565924206,
"median_itl": 0.15638201730325818,
"std_itl": 0.07709320837542873,
"p99_itl": 0.4401542016863823,
"mean_e2el": 16.489983734406998,
"median_e2el": 16.359812069917098,
"std_e2el": 1.5009343391601329,
"p99_e2el": 20.033575170855034
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 802.8428820551818,
"output_tput_per_gpu": 88.90200202168872,
"input_tput_per_gpu": 713.940880033493,
"mean_ttft": 0.5166970259918889,
"median_ttft": 0.3985831239988329,
"std_ttft": 0.5898435563333398,
"p99_ttft": 4.558906032857485,
"mean_tpot": 0.02184615116630585,
"mean_intvty": 45.77465350245942,
"median_tpot": 0.021682647480500764,
"median_intvty": 46.11982927358393,
"std_tpot": 0.001226003000489979,
"std_intvty": 815.658688926817,
"p99_tpot": 0.025910351617784596,
"p99_intvty": 38.594613255406784,
"mean_itl": 0.02185711061204737,
"median_itl": 0.016006026999093592,
"std_itl": 0.03782937626545686,
"p99_itl": 0.19949003738583915,
"mean_e2el": 20.620621201734977,
"median_e2el": 20.475688248989172,
"std_e2el": 1.9291925013978874,
"p99_e2el": 27.273808844043522
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1784.342020452637,
"output_tput_per_gpu": 199.53908362944154,
"input_tput_per_gpu": 1584.8029368231953,
"mean_ttft": 0.5289549870751216,
"median_ttft": 0.2255440845619887,
"std_ttft": 0.9671436790256961,
"p99_ttft": 5.107263615098782,
"mean_tpot": 0.019014835322597123,
"mean_intvty": 52.59051593318853,
"median_tpot": 0.019295236422416623,
"median_intvty": 51.82626313084354,
"std_tpot": 0.0014936002681296275,
"std_intvty": 669.5231792186659,
"p99_tpot": 0.021456263733912625,
"p99_intvty": 46.606436814973215,
"mean_itl": 0.18929225326811402,
"median_itl": 0.14069776004180312,
"std_itl": 0.11496910796421657,
"p99_itl": 0.6102185765095055,
"mean_e2el": 18.131359762921058,
"median_e2el": 18.065733368042856,
"std_e2el": 2.1365388543707295,
"p99_e2el": 23.75729007682763
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1312.0150189933338,
"output_tput_per_gpu": 145.92981337573372,
"input_tput_per_gpu": 1166.0852056176,
"mean_ttft": 0.7064205789287734,
"median_ttft": 0.3870421100000385,
"std_ttft": 2.0107288322323558,
"p99_ttft": 16.949745480288403,
"mean_tpot": 0.05381940959614204,
"mean_intvty": 18.58065719233909,
"median_tpot": 0.05419778648235825,
"median_intvty": 18.450938034628166,
"std_tpot": 0.002597678608348323,
"std_intvty": 384.9590926245599,
"p99_tpot": 0.05686826103903246,
"p99_intvty": 17.584501121172558,
"mean_itl": 0.05383615815912863,
"median_itl": 0.03308859499520622,
"std_itl": 0.07244341641030112,
"p99_itl": 0.30686196365801144,
"mean_e2el": 50.31938420279839,
"median_e2el": 50.308018951516715,
"std_e2el": 4.278553279683933,
"p99_e2el": 59.3132445957916
},
{
"hw": "h200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 773.4306025073362,
"output_tput_per_gpu": 85.70105852985051,
"input_tput_per_gpu": 687.7295439774857,
"mean_ttft": 0.671161383902654,
"median_ttft": 0.3430789574049413,
"std_ttft": 0.9927491191070767,
"p99_ttft": 4.90843079039827,
"mean_tpot": 0.021999170360614634,
"mean_intvty": 45.45625965015078,
"median_tpot": 0.02218140413253213,
"median_intvty": 45.08280873587079,
"std_tpot": 0.0014620476139260005,
"std_intvty": 683.9722526646889,
"p99_tpot": 0.025306446806529253,
"p99_intvty": 39.51562254650434,
"mean_itl": 0.21915107519481225,
"median_itl": 0.1779776830226183,
"std_itl": 0.13229147963327545,
"p99_itl": 0.7472931426204745,
"mean_e2el": 20.80429652701132,
"median_e2el": 20.793710296507925,
"std_e2el": 2.225848314125775,
"p99_e2el": 26.14124676471576
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 275.06097653368056,
"output_tput_per_gpu": 30.598356504577875,
"input_tput_per_gpu": 244.46262002910268,
"mean_ttft": 0.37592672517421305,
"median_ttft": 0.33001660399895627,
"std_ttft": 0.17418341437838386,
"p99_ttft": 1.0815924787681432,
"mean_tpot": 0.015555125263081565,
"mean_intvty": 64.28749258441484,
"median_tpot": 0.015618892197557573,
"median_intvty": 64.02502734197606,
"std_tpot": 0.0003064976982034653,
"std_intvty": 3262.667243054336,
"p99_tpot": 0.016007779930479155,
"p99_intvty": 62.46962441656126,
"mean_itl": 0.15456007267786598,
"median_itl": 0.14807478299189825,
"std_itl": 0.04551982244229457,
"p99_itl": 0.40876249720575225,
"mean_e2el": 14.672733181676449,
"median_e2el": 14.5178090959962,
"std_e2el": 1.012110406754354,
"p99_e2el": 16.564466049517506
},
{
"hw": "h200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 543.7960327448303,
"output_tput_per_gpu": 61.131845879475165,
"input_tput_per_gpu": 482.66418686535513,
"mean_ttft": 0.5419170456007123,
"median_ttft": 0.35557929053902626,
"std_ttft": 0.5255622406846244,
"p99_ttft": 2.502335314434022,
"mean_tpot": 0.015515354079148053,
"mean_intvty": 64.45228351855376,
"median_tpot": 0.015664260003038825,
"median_intvty": 63.839594069940276,
"std_tpot": 0.0007023671531935608,
"std_intvty": 1423.7567851132362,
"p99_tpot": 0.01635325241736486,
"p99_intvty": 61.14991528768554,
"mean_itl": 0.15461412260099675,
"median_itl": 0.13652569521218538,
"std_itl": 0.07912890869799177,
"p99_itl": 0.4479849353805181,
"mean_e2el": 14.911366847297177,
"median_e2el": 14.904470334760845,
"std_e2el": 1.4380462756115002,
"p99_e2el": 18.253084476822984
},
{
"hw": "gb200",
"conc": 24,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 761.4860058693941,
"output_tput_per_gpu": 98.70133764900027,
"input_tput_per_gpu": 4738.194015191758,
"mean_ttft": 0.594742333044754,
"median_ttft": 0.3296173345297575,
"std_ttft": 0.7710360856366048,
"p99_ttft": 3.9206191188166857,
"mean_tpot": 0.009148877402988682,
"mean_intvty": 109.30302767783598,
"median_tpot": 0.009146032545963089,
"median_intvty": 109.33702618862687,
"std_tpot": 0.00017189360352056764,
"std_intvty": 5817.552134104551,
"p99_tpot": 0.009625820167651117,
"p99_intvty": 103.8872514324168,
"mean_itl": 0.1808691778678854,
"median_itl": 0.18200021819211543,
"std_itl": 0.019418232540398195,
"p99_itl": 0.19711027250625193,
"mean_e2el": 9.002332920823392,
"median_e2el": 8.84191721456591,
"std_e2el": 0.9525785078960318,
"p99_e2el": 12.555758907049896
},
{
"hw": "gb200",
"conc": 12,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 440.86576142822025,
"output_tput_per_gpu": 57.16921063911058,
"input_tput_per_gpu": 2743.0450661628784,
"mean_ttft": 0.37804063807804295,
"median_ttft": 0.2237494420260191,
"std_ttft": 0.3853108396343019,
"p99_ttft": 1.9958291072049172,
"mean_tpot": 0.008029833455048341,
"mean_intvty": 124.53558415602032,
"median_tpot": 0.008054892344783428,
"median_intvty": 124.1481521038115,
"std_tpot": 8.591974983132742e-05,
"std_intvty": 11638.76759375046,
"p99_tpot": 0.008133869363492526,
"p99_intvty": 122.94271708964592,
"mean_itl": 0.15870654870776307,
"median_itl": 0.1602090890519321,
"std_itl": 0.01394172095451682,
"p99_itl": 0.1739094460220076,
"mean_e2el": 7.721524789594696,
"median_e2el": 7.67230872658547,
"std_e2el": 0.6401566325935075,
"p99_e2el": 9.856081473652738
},
{
"hw": "gb200",
"conc": 6,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 242.63256165655244,
"output_tput_per_gpu": 31.706826918526062,
"input_tput_per_gpu": 1508.1869700847105,
"mean_ttft": 0.3628668673482025,
"median_ttft": 0.22828348551411182,
"std_ttft": 0.2673821037091886,
"p99_ttft": 1.2782599582197147,
"mean_tpot": 0.007296252724131734,
"mean_intvty": 137.0566560410641,
"median_tpot": 0.007219382767591185,
"median_intvty": 138.5159967537861,
"std_tpot": 0.00025250965381365376,
"std_intvty": 3960.2446278666903,
"p99_tpot": 0.008207775472551828,
"p99_intvty": 121.83569145430049,
"mean_itl": 0.14405713339773826,
"median_itl": 0.1445109189953655,
"std_itl": 0.01985317571162597,
"p99_itl": 0.23120440877974008,
"mean_e2el": 7.088534150185296,
"median_e2el": 7.1553651300491765,
"std_e2el": 0.5301320004769424,
"p99_e2el": 8.354706758961548
},
{
"hw": "gb200",
"conc": 3,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 136.43294579598606,
"output_tput_per_gpu": 18.01104155751347,
"input_tput_per_gpu": 846.9643712268215,
"mean_ttft": 0.23349892749683931,
"median_ttft": 0.20825126150157303,
"std_ttft": 0.07642163845054613,
"p99_ttft": 0.5119086358556523,
"mean_tpot": 0.006625854144684956,
"mean_intvty": 150.92393798045907,
"median_tpot": 0.006624274571530683,
"median_intvty": 150.9599261325498,
"std_tpot": 5.263836789946911e-05,
"std_intvty": 18997.549504381303,
"p99_tpot": 0.006743796422528986,
"p99_intvty": 148.28442873205694,
"mean_itl": 0.13090612075011115,
"median_itl": 0.13212866010144353,
"std_itl": 0.011258962567458015,
"p99_itl": 0.138612048481591,
"mean_e2el": 6.4460851094433265,
"median_e2el": 6.588208189001307,
"std_e2el": 0.45506480666563215,
"p99_e2el": 7.056479731758591
},
{
"hw": "gb200",
"conc": 102,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 1577.6322622904477,
"output_tput_per_gpu": 203.77719448223317,
"input_tput_per_gpu": 9820.762669139734,
"mean_ttft": 7.010432495785519,
"median_ttft": 6.6427035295637324,
"std_ttft": 2.063952149136987,
"p99_ttft": 16.75019682894927,
"mean_tpot": 0.012250833148424662,
"mean_intvty": 81.62710142930892,
"median_tpot": 0.012402664649735537,
"median_intvty": 80.62783508552923,
"std_tpot": 0.0005069500479492191,
"std_intvty": 1972.5809358246072,
"p99_tpot": 0.012536323704789921,
"p99_intvty": 79.76820187069009,
"mean_itl": 0.2423397053689313,
"median_itl": 0.24737928295508027,
"std_itl": 0.023138919552489555,
"p99_itl": 0.2619257148867473,
"mean_e2el": 18.27863471651354,
"median_e2el": 18.081961779971607,
"std_e2el": 2.325368755434448,
"p99_e2el": 27.8889432685217
},
{
"hw": "gb200",
"conc": 48,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 1257.7501783176956,
"output_tput_per_gpu": 164.20630950595387,
"input_tput_per_gpu": 7819.013391188146,
"mean_ttft": 0.9185124930663733,
"median_ttft": 0.4320086045190692,
"std_ttft": 1.5513275647369027,
"p99_ttft": 7.753300683994315,
"mean_tpot": 0.010638501103430278,
"mean_intvty": 93.99820428439492,
"median_tpot": 0.010704722419802744,
"median_intvty": 93.41671467819593,
"std_tpot": 0.00031234579290101113,
"std_intvty": 3201.5798602958,
"p99_tpot": 0.01104477181651935,
"p99_intvty": 90.5405758138279,
"mean_itl": 0.21051679319774125,
"median_itl": 0.21281073952559382,
"std_itl": 0.01934866488296067,
"p99_itl": 0.23799534296616912,
"mean_e2el": 10.777715545666675,
"median_e2el": 10.47987555898726,
"std_e2el": 1.6945846099077986,
"p99_e2el": 17.529918673820802
},
{
"hw": "h200",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 946.6983809880413,
"output_tput_per_gpu": 105.86721898075923,
"input_tput_per_gpu": 840.831162007282,
"mean_ttft": 0.9950035052024759,
"median_ttft": 0.400475081987679,
"std_ttft": 1.9245196543264513,
"p99_ttft": 10.120028141587973,
"mean_tpot": 0.03590065801533644,
"mean_intvty": 27.85464265231041,
"median_tpot": 0.0363796425978401,
"median_intvty": 27.487900611188824,
"std_tpot": 0.002986718785089788,
"std_intvty": 334.8155859172853,
"p99_tpot": 0.04280596001275761,
"p99_intvty": 23.36123286808581,
"mean_itl": 0.35741948770541876,
"median_itl": 0.265468068420887,
"std_itl": 0.2233565105083265,
"p99_itl": 1.1635619221627713,
"mean_e2el": 34.23166489775176,
"median_e2el": 34.100104039534926,
"std_e2el": 4.271861819664215,
"p99_e2el": 46.03864844253287
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1102.293180239758,
"output_tput_per_gpu": 122.15008417254694,
"input_tput_per_gpu": 980.143096067211,
"mean_ttft": 0.6389724074593142,
"median_ttft": 0.4089398225041805,
"std_ttft": 1.2060549337569364,
"p99_ttft": 8.970479531380406,
"mean_tpot": 0.03191911367044807,
"mean_intvty": 31.32919072642791,
"median_tpot": 0.031730023001261196,
"median_intvty": 31.515892691292795,
"std_tpot": 0.00215010263195497,
"std_intvty": 465.09407743515715,
"p99_tpot": 0.037819377295942994,
"p99_intvty": 26.44147184589613,
"mean_itl": 0.031942659746625045,
"median_itl": 0.0205818570102565,
"std_itl": 0.05527674820121271,
"p99_itl": 0.21809745996433777,
"mean_e2el": 29.987768115863656,
"median_e2el": 29.790100703496137,
"std_e2el": 3.106982903099679,
"p99_e2el": 39.85459824586752
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 902.5200930080148,
"output_tput_per_gpu": 101.4584805821573,
"input_tput_per_gpu": 801.0616124258574,
"mean_ttft": 6.504806140399159,
"median_ttft": 0.2960052300040843,
"std_ttft": 19.913883015608675,
"p99_ttft": 71.13014924332441,
"mean_tpot": 0.012459474521871856,
"mean_intvty": 80.2602066599647,
"median_tpot": 0.011560908525177526,
"median_intvty": 86.498392217375,
"std_tpot": 0.008174706292436482,
"std_intvty": 122.32855398428619,
"p99_tpot": 0.02765882277460141,
"p99_intvty": 36.15482871954629,
"mean_itl": 0.12442615911055785,
"median_itl": 0.10253587400075048,
"std_itl": 0.8209611799087925,
"p99_itl": 0.2929841907584341,
"mean_e2el": 18.068662138199215,
"median_e2el": 11.066165534997708,
"std_e2el": 21.27009450976325,
"p99_e2el": 83.11342277751245
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 583.8128537944978,
"output_tput_per_gpu": 65.21649389053113,
"input_tput_per_gpu": 518.5963599039667,
"mean_ttft": 1.8585667663766072,
"median_ttft": 0.45114935794845223,
"std_ttft": 10.234254490944684,
"p99_ttft": 78.29275120790116,
"mean_tpot": 0.013298849007126324,
"mean_intvty": 75.19447731635572,
"median_tpot": 0.01310156846391122,
"median_intvty": 76.32673925679501,
"std_tpot": 0.004423266421801681,
"std_intvty": 226.07727065029036,
"p99_tpot": 0.014278045632900516,
"p99_intvty": 70.03759658084626,
"mean_itl": 0.3924580679842618,
"median_itl": 0.31462065083906054,
"std_itl": 0.7087298403103781,
"p99_itl": 0.9959333436563611,
"mean_e2el": 14.148390990492189,
"median_e2el": 12.60518203722313,
"std_e2el": 10.954200990130499,
"p99_e2el": 89.64725519032216
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 1166.4328839773923,
"output_tput_per_gpu": 130.4396500874052,
"input_tput_per_gpu": 1035.993233889987,
"mean_ttft": 0.8573595175974333,
"median_ttft": 0.3606759265094297,
"std_ttft": 1.5422136990942272,
"p99_ttft": 8.179303199195711,
"mean_tpot": 0.02895988266072291,
"mean_intvty": 34.53052665010479,
"median_tpot": 0.02936962215651187,
"median_intvty": 34.04878669092032,
"std_tpot": 0.0023642746691267522,
"std_intvty": 422.9627010170317,
"p99_tpot": 0.03385805052004762,
"p99_intvty": 29.53507318467412,
"mean_itl": 0.28834165544620755,
"median_itl": 0.21187163499416783,
"std_itl": 0.18450892510538436,
"p99_itl": 0.9587106351717375,
"mean_e2el": 27.6704300711458,
"median_e2el": 27.499588727994706,
"std_e2el": 3.3372098120782607,
"p99_e2el": 36.714599195705084
},
{
"hw": "mi325x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 761.0409406634109,
"output_tput_per_gpu": 85.10555160604429,
"input_tput_per_gpu": 675.9353890573666,
"mean_ttft": 1.694753202056745,
"median_ttft": 0.5177300707437098,
"std_ttft": 3.442364259003156,
"p99_ttft": 13.745477307308466,
"mean_tpot": 0.04398266381726651,
"mean_intvty": 22.73623089667035,
"median_tpot": 0.045158661082335655,
"median_intvty": 22.14414635050289,
"std_tpot": 0.0038567964969464822,
"std_intvty": 259.28254207649377,
"p99_tpot": 0.05014301314433729,
"p99_intvty": 19.94295789767335,
"mean_itl": 0.0440480005900246,
"median_itl": 0.031085669994354248,
"std_itl": 0.06951553693506261,
"p99_itl": 0.33191738606429705,
"mean_e2el": 42.414789404482875,
"median_e2el": 42.3201302443631,
"std_e2el": 5.339530465156818,
"p99_e2el": 57.51015056228731
},
{
"hw": "b200-trt",
"conc": 256,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 5787.942692824528,
"output_tput_per_gpu": 643.2896992853289,
"input_tput_per_gpu": 5144.652993539199,
"mean_ttft": 3.602405921043078,
"median_ttft": 2.7360842505004257,
"std_ttft": 3.4631284782643474,
"p99_ttft": 18.72607067550794,
"mean_tpot": 0.04457944244592752,
"mean_intvty": 22.431864221113724,
"median_tpot": 0.04587442699516998,
"median_intvty": 21.798637399989495,
"std_tpot": 0.004766511511860745,
"std_intvty": 209.7970386752766,
"p99_tpot": 0.052887799498808806,
"p99_intvty": 18.907952485762298,
"mean_itl": 0.4440711351974903,
"median_itl": 0.262256239017006,
"std_itl": 0.40187682044699613,
"p99_itl": 1.8469759417627925,
"mean_e2el": 44.74923921341838,
"median_e2el": 44.7149408635014,
"std_e2el": 6.074768371262613,
"p99_e2el": 61.0329452951773
},
{
"hw": "h200",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 284.54994515746006,
"output_tput_per_gpu": 31.653929157849564,
"input_tput_per_gpu": 252.89601599961048,
"mean_ttft": 0.46475899759680034,
"median_ttft": 0.38723520981147885,
"std_ttft": 0.24064916931338368,
"p99_ttft": 1.4157039189524947,
"mean_tpot": 0.014953846514159796,
"mean_intvty": 66.87242637224476,
"median_tpot": 0.015034258991089029,
"median_intvty": 66.51475144818983,
"std_tpot": 0.00040882579411661516,
"std_intvty": 2446.0296155255696,
"p99_tpot": 0.015685303135512755,
"p99_intvty": 63.753947970308694,
"mean_itl": 0.14857945320297133,
"median_itl": 0.13972298707813025,
"std_itl": 0.05807973580672603,
"p99_itl": 0.466889410233125,
"mean_e2el": 14.208358070859685,
"median_e2el": 14.039410853292793,
"std_e2el": 1.0162798560619406,
"p99_e2el": 16.393919616816564
},
{
"hw": "mi325x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1007.1262607499572,
"output_tput_per_gpu": 111.73186908314874,
"input_tput_per_gpu": 895.3943916668085,
"mean_ttft": 2.636473454924271,
"median_ttft": 0.5606257759500295,
"std_ttft": 6.0379214507763335,
"p99_ttft": 26.459134709569625,
"mean_tpot": 0.06756731595899554,
"mean_intvty": 14.800055112546847,
"median_tpot": 0.0698039396881523,
"median_intvty": 14.325838977964281,
"std_tpot": 0.007546197590353992,
"std_intvty": 132.51707075338985,
"p99_tpot": 0.0789253607034334,
"p99_intvty": 12.670198667289691,
"mean_itl": 0.06767759124070488,
"median_itl": 0.041047726292163134,
"std_itl": 0.1330114276936734,
"p99_itl": 0.5373343603219837,
"mean_e2el": 64.9513555161102,
"median_e2el": 65.30839010188356,
"std_e2el": 9.470472555362436,
"p99_e2el": 91.18379649377428
},
{
"hw": "gb200",
"conc": 2150,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 8,
"tput_per_gpu": 11586.899389893282,
"output_tput_per_gpu": 5153.757764258276,
"input_tput_per_gpu": 13731.279931771618,
"mean_ttft": 5.6080046435918245,
"median_ttft": 2.340515922056511,
"std_ttft": 8.545496080518607,
"p99_ttft": 41.48888620681361,
"mean_tpot": 0.04399792630381974,
"mean_intvty": 22.728343901816658,
"median_tpot": 0.045152946400419225,
"median_intvty": 22.146948974977974,
"std_tpot": 0.00429923235964204,
"std_intvty": 232.59966346253992,
"p99_tpot": 0.049259746347427,
"p99_intvty": 20.300551142651862,
"mean_itl": 1.552566914864844,
"median_itl": 1.6202171394834295,
"std_itl": 0.2802925605283183,
"p99_itl": 2.1835986783821144,
"mean_e2el": 46.1444434842955,
"median_e2el": 44.3630206604721,
"std_e2el": 9.08788930704751,
"p99_e2el": 82.37627887833158
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 792.4734792352098,
"output_tput_per_gpu": 88.15640205906483,
"input_tput_per_gpu": 704.317077176145,
"mean_ttft": 0.2993771802634001,
"median_ttft": 0.2601106141228229,
"std_ttft": 0.13843330790320135,
"p99_ttft": 0.8651273151254281,
"mean_tpot": 0.0107843884835238,
"mean_intvty": 92.72662993620662,
"median_tpot": 0.010858123682905365,
"median_intvty": 92.0969431923458,
"std_tpot": 0.0003155094460644044,
"std_intvty": 3169.477213673887,
"p99_tpot": 0.011184152033164266,
"p99_intvty": 89.41223232970269,
"mean_itl": 0.10716487726007864,
"median_itl": 0.10180806531570852,
"std_itl": 0.03696297334400772,
"p99_itl": 0.31483862059656514,
"mean_e2el": 10.212128136551474,
"median_e2el": 10.162146556423977,
"std_e2el": 0.7333384061127683,
"p99_e2el": 11.606803273446857
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 4002.0655557134196,
"output_tput_per_gpu": 443.29079247943395,
"input_tput_per_gpu": 3558.7747632339856,
"mean_ttft": 2.6895396037034516,
"median_ttft": 2.2323225885047577,
"std_ttft": 1.9130492481971086,
"p99_ttft": 9.487109020580247,
"mean_tpot": 0.03210708087049574,
"mean_intvty": 31.14577759446618,
"median_tpot": 0.03286953115370787,
"median_intvty": 30.42331195184067,
"std_tpot": 0.0024601546634111246,
"std_intvty": 406.47850920618595,
"p99_tpot": 0.03518312436142935,
"p99_intvty": 28.422717372317358,
"mean_itl": 0.31946906588998664,
"median_itl": 0.22735516999091487,
"std_itl": 0.24158929514621547,
"p99_itl": 0.8595785125729162,
"mean_e2el": 32.22645127788521,
"median_e2el": 32.143837570009055,
"std_e2el": 3.49088992030142,
"p99_e2el": 40.36630205641501
},
{
"hw": "mi325x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 526.8705266523322,
"output_tput_per_gpu": 58.380624836804785,
"input_tput_per_gpu": 468.4899018155274,
"mean_ttft": 1.166195775414235,
"median_ttft": 0.4937654985114932,
"std_ttft": 2.0136986772616745,
"p99_ttft": 7.420154744726606,
"mean_tpot": 0.03187865361902512,
"mean_intvty": 31.36895340533459,
"median_tpot": 0.0323879934339943,
"median_intvty": 30.875639209880912,
"std_tpot": 0.00197378449272947,
"std_intvty": 506.64092441882485,
"p99_tpot": 0.03460040259689892,
"p99_intvty": 28.901397814649286,
"mean_itl": 0.031921237410962124,
"median_itl": 0.025696489959955215,
"std_itl": 0.043156494953875524,
"p99_itl": 0.2847564077237621,
"mean_e2el": 30.347393684479176,
"median_e2el": 30.27519692061469,
"std_e2el": 2.976471461442712,
"p99_e2el": 36.9449519599555
},
{
"hw": "gb200",
"conc": 512,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 1332.876115660653,
"output_tput_per_gpu": 433.3331654053297,
"input_tput_per_gpu": 1782.6475907883148,
"mean_ttft": 9.953075398912032,
"median_ttft": 8.665439720032737,
"std_ttft": 9.52626791510207,
"p99_ttft": 37.07860949732829,
"mean_tpot": 0.031791355783370084,
"mean_intvty": 31.455091340366664,
"median_tpot": 0.03177086336509092,
"median_intvty": 31.475380083588682,
"std_tpot": 0.0005459615016026859,
"std_intvty": 1831.6309795919144,
"p99_tpot": 0.032933163321950225,
"p99_intvty": 30.36452922010962,
"mean_itl": 1.504844924391102,
"median_itl": 1.537836599512957,
"std_itl": 0.2013939781693081,
"p99_itl": 1.6456630894518458,
"mean_e2el": 38.41019501230004,
"median_e2el": 36.95558860595338,
"std_e2el": 9.624402903919666,
"p99_e2el": 65.33690612451173
},
{
"hw": "gb200",
"conc": 256,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 867.7836641810992,
"output_tput_per_gpu": 281.1370997081605,
"input_tput_per_gpu": 1161.1069464175687,
"mean_ttft": 2.521223300898055,
"median_ttft": 0.9612152059562504,
"std_ttft": 4.155598492219148,
"p99_ttft": 21.420394152011717,
"mean_tpot": 0.03193347738286641,
"mean_intvty": 31.315098822796546,
"median_tpot": 0.03190404793363996,
"median_intvty": 31.343985004034224,
"std_tpot": 0.0006270813776449388,
"std_intvty": 1594.6893587489253,
"p99_tpot": 0.033213573505175516,
"p99_intvty": 30.108172486895295,
"mean_itl": 1.5113475017039775,
"median_itl": 1.5498161260038614,
"std_itl": 0.20223573818443075,
"p99_itl": 1.749023714121431,
"mean_e2el": 31.037420247295387,
"median_e2el": 30.070463735144585,
"std_e2el": 4.606653027739007,
"p99_e2el": 49.68362816823179
},
{
"hw": "gb200",
"conc": 2048,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 1983.0095541617811,
"output_tput_per_gpu": 644.7490283221649,
"input_tput_per_gpu": 2652.139817081589,
"mean_ttft": 64.89443928080613,
"median_ttft": 64.21661266044248,
"std_ttft": 25.719279257431694,
"p99_ttft": 125.17794490979266,
"mean_tpot": 0.03208382991427337,
"mean_intvty": 31.168348749882966,
"median_tpot": 0.03207638092454229,
"median_intvty": 31.175586870365404,
"std_tpot": 0.0005936252795421362,
"std_intvty": 1684.5643783420091,
"p99_tpot": 0.03339366994665898,
"p99_intvty": 29.94579516409365,
"mean_itl": 1.5185880041283804,
"median_itl": 1.54076954908669,
"std_itl": 0.20537279727179938,
"p99_itl": 1.6982681187707926,
"mean_e2el": 93.60401499152496,
"median_e2el": 92.96038837602828,
"std_e2el": 25.839062239422645,
"p99_e2el": 154.61504848154493
},
{
"hw": "gb200",
"conc": 1024,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 1717.5547860881363,
"output_tput_per_gpu": 558.6358482850104,
"input_tput_per_gpu": 2297.014254989699,
"mean_ttft": 26.54257281240197,
"median_ttft": 23.726894593099132,
"std_ttft": 13.73065002955058,
"p99_ttft": 62.08127356006765,
"mean_tpot": 0.03194973955454611,
"mean_intvty": 31.299159678367726,
"median_tpot": 0.03192979677890738,
"median_intvty": 31.318708569438613,
"std_tpot": 0.0005566696520757155,
"std_intvty": 1796.3975515302293,
"p99_tpot": 0.03310386954947478,
"p99_intvty": 30.20794890776948,
"mean_itl": 1.5123055060638444,
"median_itl": 1.5449564179871231,
"std_itl": 0.20564354161582168,
"p99_itl": 1.6685769137367605,
"mean_e2el": 55.154658890648435,
"median_e2el": 52.40749449306168,
"std_e2el": 13.88340356818276,
"p99_e2el": 90.72524178851398
},
{
"hw": "gb200",
"conc": 576,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 1421.3460214133402,
"output_tput_per_gpu": 461.95692469912177,
"input_tput_per_gpu": 1901.0405697704493,
"mean_ttft": 10.533482075181809,
"median_ttft": 9.638215720886365,
"std_ttft": 9.112935147123567,
"p99_ttft": 38.30993102346547,
"mean_tpot": 0.0319982706593111,
"mean_intvty": 31.251688900537893,
"median_tpot": 0.03200607933104038,
"median_intvty": 31.244064280943412,
"std_tpot": 0.0005681861637047689,
"std_intvty": 1759.9865394110564,
"p99_tpot": 0.03316614225671057,
"p99_intvty": 30.15123050066723,
"mean_itl": 1.5145775326450877,
"median_itl": 1.5524821674916893,
"std_itl": 0.20503224231303058,
"p99_itl": 1.6526376004260965,
"mean_e2el": 39.16307608043521,
"median_e2el": 37.79507863894105,
"std_e2el": 9.369511360362129,
"p99_e2el": 66.66888370253143
},
{
"hw": "gb200",
"conc": 384,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 1162.8688480064793,
"output_tput_per_gpu": 377.67353278352033,
"input_tput_per_gpu": 1555.466505617959,
"mean_ttft": 6.014340951348999,
"median_ttft": 1.4540226301178336,
"std_ttft": 7.698878062300592,
"p99_ttft": 32.45374891515821,
"mean_tpot": 0.031623398653318584,
"mean_intvty": 31.62215456228514,
"median_tpot": 0.031615410366664744,
"median_intvty": 31.63014455299935,
"std_tpot": 0.00053515125114295,
"std_intvty": 1868.6305934336299,
"p99_tpot": 0.03270536676747725,
"p99_intvty": 30.576021578037043,
"mean_itl": 1.496807710664101,
"median_itl": 1.5261819730512798,
"std_itl": 0.1987972535465383,
"p99_itl": 1.6362171104433945,
"mean_e2el": 34.29839658493779,
"median_e2el": 31.366031406912953,
"std_e2el": 7.8358978561484465,
"p99_e2el": 61.64612164883873
},
{
"hw": "gb200",
"conc": 4096,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 2168.162548550103,
"output_tput_per_gpu": 705.6406792729853,
"input_tput_per_gpu": 2899.423483188662,
"mean_ttft": 150.22241129540342,
"median_ttft": 156.10788435989525,
"std_ttft": 46.59709920985325,
"p99_ttft": 248.22315430693106,
"mean_tpot": 0.03200886914012845,
"mean_intvty": 31.241341130241096,
"median_tpot": 0.03200315403455764,
"median_intvty": 31.24692019168424,
"std_tpot": 0.0005915525088231782,
"std_intvty": 1690.4670085659486,
"p99_tpot": 0.03325934703737439,
"p99_intvty": 30.066735792385643,
"mean_itl": 1.5151371847117296,
"median_itl": 1.5467814865987748,
"std_itl": 0.20930433757606617,
"p99_itl": 1.682421277486719,
"mean_e2el": 178.89063952145983,
"median_e2el": 184.58937938360032,
"std_e2el": 46.66393347731728,
"p99_e2el": 277.06916467992596
},
{
"hw": "gb200",
"conc": 128,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 415.7561597470263,
"output_tput_per_gpu": 134.85625251759976,
"input_tput_per_gpu": 556.2061133617395,
"mean_ttft": 3.243285306732806,
"median_ttft": 0.8468559158500284,
"std_ttft": 4.362480535262369,
"p99_ttft": 19.05465576951385,
"mean_tpot": 0.03236366685479545,
"mean_intvty": 30.898847293375418,
"median_tpot": 0.032354479697778035,
"median_intvty": 30.90762111895979,
"std_tpot": 0.0005554307375362904,
"std_intvty": 1800.4045012627028,
"p99_tpot": 0.0334191193032958,
"p99_intvty": 29.922990816259475,
"mean_itl": 1.5317677273793802,
"median_itl": 1.566724831936881,
"std_itl": 0.2053775693729625,
"p99_itl": 1.654787944376003,
"mean_e2el": 32.1839179824569,
"median_e2el": 30.99567767395638,
"std_e2el": 4.717033935459132,
"p99_e2el": 47.189649052661835
},
{
"hw": "gb200",
"conc": 448,
"image": "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1-rc0.pre1",
"model": "deepseek-ai/DeepSeek-R1",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 1,
"prefill_ep": 1,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 1,
"decode_ep": 1,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 48,
"num_decode_gpu": 24,
"tput_per_gpu": 1177.34614542953,
"output_tput_per_gpu": 383.00978315020285,
"input_tput_per_gpu": 1574.5143265691938,
"mean_ttft": 7.86869031612678,
"median_ttft": 4.594009778928012,
"std_ttft": 8.561461137041196,
"p99_ttft": 37.736418539739674,
"mean_tpot": 0.03177480204675891,
"mean_intvty": 31.471478517110132,
"median_tpot": 0.03175177196098958,
"median_intvty": 31.494305301405102,
"std_tpot": 0.0005627267160252085,
"std_intvty": 1777.0615318630134,
"p99_tpot": 0.032955840535432795,
"p99_intvty": 30.343635111501406,
"mean_itl": 1.5041272519323332,
"median_itl": 1.5345245760399848,
"std_itl": 0.20305583232993943,
"p99_itl": 1.6554095538146794,
"mean_e2el": 36.33897268572286,
"median_e2el": 32.994612449081615,
"std_e2el": 8.755094490664163,
"p99_e2el": 65.54059818935342
},
{
"hw": "h200",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 455.9254377635803,
"output_tput_per_gpu": 51.253708956302425,
"input_tput_per_gpu": 404.6717288072779,
"mean_ttft": 0.5594684223760851,
"median_ttft": 0.391155275166966,
"std_ttft": 0.480253736308439,
"p99_ttft": 2.6844802987040013,
"mean_tpot": 0.018602590764619394,
"mean_intvty": 53.75595327839594,
"median_tpot": 0.01873435791437878,
"median_intvty": 53.37786352594937,
"std_tpot": 0.0007553932863165987,
"std_intvty": 1323.8137247368682,
"p99_tpot": 0.01988876235290793,
"p99_intvty": 50.279649495323696,
"mean_itl": 0.18515248294366918,
"median_itl": 0.16496686707250774,
"std_itl": 0.0937382655081174,
"p99_itl": 0.5121084999898449,
"mean_e2el": 17.78559239502938,
"median_e2el": 17.684128840453923,
"std_e2el": 1.652266931668318,
"p99_e2el": 21.81177438230952
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "true",
"tput_per_gpu": 5986.4712751791685,
"output_tput_per_gpu": 663.0944842822487,
"input_tput_per_gpu": 5323.37679089692,
"mean_ttft": 2.7622101650124704,
"median_ttft": 1.9597175660019275,
"std_ttft": 2.853256070319707,
"p99_ttft": 14.819793385498926,
"mean_tpot": 0.04389634019420483,
"mean_intvty": 22.780942456155362,
"median_tpot": 0.04488137079634876,
"median_intvty": 22.280959388195715,
"std_tpot": 0.0038341752469974225,
"std_intvty": 260.8122831065453,
"p99_tpot": 0.049970237952878666,
"p99_intvty": 20.011911909304654,
"mean_itl": 0.43694390293316604,
"median_itl": 0.29546207598468754,
"std_itl": 0.3165747844827714,
"p99_itl": 1.6882043505387263,
"mean_e2el": 43.160404740924264,
"median_e2el": 43.0627205514902,
"std_e2el": 5.319060045165148,
"p99_e2el": 57.57996989592997
},
{
"hw": "mi300x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 494.5765770173539,
"output_tput_per_gpu": 54.80224862715505,
"input_tput_per_gpu": 439.77432839019883,
"mean_ttft": 1.3497309541038702,
"median_ttft": 0.5528363259509206,
"std_ttft": 2.37650759669871,
"p99_ttft": 8.732787766931578,
"mean_tpot": 0.033919200915555746,
"mean_intvty": 29.48182660580863,
"median_tpot": 0.034493818873072055,
"median_intvty": 28.990701310276204,
"std_tpot": 0.0022922837411663375,
"std_intvty": 436.2461688495813,
"p99_tpot": 0.03720066229125055,
"p99_intvty": 26.881241849159128,
"mean_itl": 0.03397400456768897,
"median_itl": 0.027208724757656455,
"std_itl": 0.04796106825324097,
"p99_itl": 0.30270305374171586,
"mean_e2el": 32.40154604961572,
"median_e2el": 32.334916272666305,
"std_e2el": 3.2918736000907423,
"p99_e2el": 39.78148355065379
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 3322.5741890528784,
"output_tput_per_gpu": 368.61060899540354,
"input_tput_per_gpu": 2953.9635800574747,
"mean_ttft": 8.955083923936035,
"median_ttft": 1.1607401908840984,
"std_ttft": 23.556375568788926,
"p99_ttft": 84.5466351998737,
"mean_tpot": 0.03305349728460488,
"mean_intvty": 30.25398466581519,
"median_tpot": 0.03374215299836984,
"median_intvty": 29.636520231779883,
"std_tpot": 0.004567360532983714,
"std_intvty": 218.94483537666585,
"p99_tpot": 0.038990885485413546,
"p99_intvty": 25.647019490596055,
"mean_itl": 0.32950496165007903,
"median_itl": 0.2197466529905796,
"std_itl": 0.4849983165974052,
"p99_itl": 1.5713326438888906,
"mean_e2el": 39.44098572878793,
"median_e2el": 32.352710059145465,
"std_e2el": 24.06024976498184,
"p99_e2el": 116.87322018155828
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 429.381182315142,
"output_tput_per_gpu": 47.655137090509506,
"input_tput_per_gpu": 381.72604522463246,
"mean_ttft": 0.505238144332543,
"median_ttft": 0.4387638543266803,
"std_ttft": 0.2757396379796443,
"p99_ttft": 2.1169873032486066,
"mean_tpot": 0.00986684716657373,
"mean_intvty": 101.3494972728204,
"median_tpot": 0.009865564874438487,
"median_intvty": 101.36267033132418,
"std_tpot": 0.0003466097911216333,
"std_intvty": 2885.088724020139,
"p99_tpot": 0.010855434389631205,
"p99_intvty": 92.11975901721362,
"mean_itl": 0.29131438420783023,
"median_itl": 0.2625766422133893,
"std_itl": 0.10921299840093948,
"p99_itl": 0.6500991531740872,
"mean_e2el": 9.565115327266977,
"median_e2el": 9.564479554072022,
"std_e2el": 0.8077112600599302,
"p99_e2el": 12.545057391291484
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 836.4575286485567,
"output_tput_per_gpu": 94.03193412053514,
"input_tput_per_gpu": 742.4255945280215,
"mean_ttft": 0.2991049903386738,
"median_ttft": 0.21255931234918535,
"std_ttft": 0.23741315614060687,
"p99_ttft": 1.346079852469265,
"mean_tpot": 0.010138902600354526,
"mean_intvty": 98.6300036026614,
"median_tpot": 0.010198530735057098,
"median_intvty": 98.05333983674085,
"std_tpot": 0.0003740565226395999,
"std_intvty": 2673.392761455709,
"p99_tpot": 0.010735585391772166,
"p99_intvty": 93.14815759990206,
"mean_itl": 0.10090198039362387,
"median_itl": 0.09032239764928818,
"std_itl": 0.04863076167210806,
"p99_itl": 0.2678565432224423,
"mean_e2el": 9.686772615503287,
"median_e2el": 9.727134476415813,
"std_e2el": 0.8694516643215193,
"p99_e2el": 11.640783026874995
},
{
"hw": "gb200",
"conc": 3,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 236.44917145422332,
"output_tput_per_gpu": 31.214570853508153,
"input_tput_per_gpu": 1467.8567750585144,
"mean_ttft": 0.2555504057478781,
"median_ttft": 0.23104987503029406,
"std_ttft": 0.08413202500719894,
"p99_ttft": 0.5524090574122964,
"mean_tpot": 0.003590601898003369,
"mean_intvty": 278.50483802063144,
"median_tpot": 0.0037511487972202847,
"median_intvty": 266.5849994383135,
"std_tpot": 0.00055126632821161,
"std_intvty": 1814.0052254672414,
"p99_tpot": 0.004260729662644132,
"p99_intvty": 234.7015838079288,
"mean_itl": 0.1712183131110239,
"median_itl": 0.17537554004229605,
"std_itl": 0.018716414773259294,
"p99_itl": 0.17713599645067007,
"mean_e2el": 3.622843738296069,
"median_e2el": 3.6881058050785214,
"std_e2el": 0.5866481972703227,
"p99_e2el": 4.588203874870669
},
{
"hw": "gb200",
"conc": 12,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 703.502053037639,
"output_tput_per_gpu": 91.2265378124714,
"input_tput_per_gpu": 4377.155144388646,
"mean_ttft": 0.43944471699069254,
"median_ttft": 0.2933722954476252,
"std_ttft": 0.4118700176945853,
"p99_ttft": 2.185169899708125,
"mean_tpot": 0.004730206147402193,
"mean_intvty": 211.4072767313102,
"median_tpot": 0.004815105150559249,
"median_intvty": 207.67978449730327,
"std_tpot": 0.0005610756752586114,
"std_intvty": 1782.2907748390257,
"p99_tpot": 0.005472014359725834,
"p99_intvty": 182.74805844078656,
"mean_itl": 0.2227848983604473,
"median_itl": 0.22745734406635165,
"std_itl": 0.031256383053872704,
"p99_itl": 0.2594779595080763,
"mean_e2el": 4.7628640371161355,
"median_e2el": 4.719590820954181,
"std_e2el": 0.7424885383484437,
"p99_e2el": 7.246178079408127
},
{
"hw": "gb200",
"conc": 24,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 1157.9408655913353,
"output_tput_per_gpu": 150.0885262124022,
"input_tput_per_gpu": 7205.0549018649335,
"mean_ttft": 0.7533605824197972,
"median_ttft": 0.518273382098414,
"std_ttft": 0.842065355367776,
"p99_ttft": 4.398706275497098,
"mean_tpot": 0.005605901279615122,
"mean_intvty": 178.38344810607438,
"median_tpot": 0.005766838773044422,
"median_intvty": 173.40522933886038,
"std_tpot": 0.0007144994025041151,
"std_intvty": 1399.5812963527856,
"p99_tpot": 0.006576600971072247,
"p99_intvty": 152.05423050578668,
"mean_itl": 0.26553995860796886,
"median_itl": 0.27304362202994525,
"std_itl": 0.03661265603461066,
"p99_itl": 0.31525636158417913,
"mean_e2el": 5.896814159167358,
"median_e2el": 5.767766778473742,
"std_e2el": 1.126134914850507,
"p99_e2el": 9.557400259713178
},
{
"hw": "gb200",
"conc": 54,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 1425.610919700181,
"output_tput_per_gpu": 185.60426321195712,
"input_tput_per_gpu": 8865.650858629526,
"mean_ttft": 4.371231129903377,
"median_ttft": 4.2330462815007195,
"std_ttft": 1.2056774197270312,
"p99_ttft": 9.673425318642984,
"mean_tpot": 0.006808302405983695,
"mean_intvty": 146.87949218018252,
"median_tpot": 0.007017827227059756,
"median_intvty": 142.49424610285936,
"std_tpot": 0.0009958585862091275,
"std_intvty": 1004.1586364251148,
"p99_tpot": 0.008408631619698527,
"p99_intvty": 118.92541441075198,
"mean_itl": 0.3225559330078379,
"median_itl": 0.3333500580629334,
"std_itl": 0.050368993440871666,
"p99_itl": 0.40406737157842165,
"mean_e2el": 10.667044979062027,
"median_e2el": 10.728371675591916,
"std_e2el": 1.6182414808153056,
"p99_e2el": 15.892013002103194
},
{
"hw": "gb200",
"conc": 6,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "false",
"prefill_num_workers": 1,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "false",
"decode_num_workers": 3,
"num_prefill_gpu": 4,
"num_decode_gpu": 24,
"tput_per_gpu": 427.3629169969921,
"output_tput_per_gpu": 55.84708807468552,
"input_tput_per_gpu": 2656.4578905308317,
"mean_ttft": 0.3416456107176297,
"median_ttft": 0.2483571219490841,
"std_ttft": 0.20785176031329594,
"p99_ttft": 1.135576991871931,
"mean_tpot": 0.004001975462280119,
"mean_intvty": 249.8765945531939,
"median_tpot": 0.004055770960198265,
"median_intvty": 246.5622466883868,
"std_tpot": 0.0005291560303101108,
"std_intvty": 1889.8017649235746,
"p99_tpot": 0.004814203127326059,
"p99_intvty": 207.71869685428658,
"mean_itl": 0.19120934253435815,
"median_itl": 0.1961024854099378,
"std_itl": 0.020975258993079084,
"p99_itl": 0.2194919937173836,
"mean_e2el": 4.030392380450697,
"median_e2el": 3.99119458894711,
"std_e2el": 0.5709705260375629,
"p99_e2el": 5.160084098486696
},
{
"hw": "mi325x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 320.97338771041916,
"output_tput_per_gpu": 36.08282239553155,
"input_tput_per_gpu": 284.8905653148876,
"mean_ttft": 0.7936841032293159,
"median_ttft": 0.4643264540936798,
"std_ttft": 1.0108986624543375,
"p99_ttft": 4.047802044227719,
"mean_tpot": 0.02637186280278715,
"mean_intvty": 37.91920227547648,
"median_tpot": 0.026512040364479773,
"median_intvty": 37.71871143270351,
"std_tpot": 0.000957194540374403,
"std_intvty": 1044.7197072486997,
"p99_tpot": 0.028001624309348543,
"p99_intvty": 35.71221401131872,
"mean_itl": 0.02639836082858806,
"median_itl": 0.023583958856761456,
"std_itl": 0.02756395823251591,
"p99_itl": 0.17242617987096298,
"mean_e2el": 25.20853780109901,
"median_e2el": 25.308424927527085,
"std_e2el": 2.2458551616848212,
"p99_e2el": 30.630026207705956
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 518.7487602096421,
"output_tput_per_gpu": 57.70669362565438,
"input_tput_per_gpu": 461.0420665839877,
"mean_ttft": 0.23643233116017653,
"median_ttft": 0.20578252244740725,
"std_ttft": 0.10761433939509357,
"p99_ttft": 0.6774951850110664,
"mean_tpot": 0.008207519368524412,
"mean_intvty": 121.83949316464238,
"median_tpot": 0.008237395984556624,
"median_intvty": 121.39758752338587,
"std_tpot": 0.0001845104843645844,
"std_intvty": 5419.746218995583,
"p99_tpot": 0.008547105417867646,
"p99_intvty": 116.9986739498391,
"mean_itl": 0.08156068560428213,
"median_itl": 0.0771625058259815,
"std_itl": 0.028274467160219374,
"p99_itl": 0.24078627801500252,
"mean_e2el": 7.780795569694601,
"median_e2el": 7.688073921948671,
"std_e2el": 0.5598020451970261,
"p99_e2el": 8.913872493896633
},
{
"hw": "h200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1307.588808988474,
"output_tput_per_gpu": 145.06556656729177,
"input_tput_per_gpu": 1162.5232424211822,
"mean_ttft": 2.2544231202075027,
"median_ttft": 0.7238704985938966,
"std_ttft": 4.868124451588704,
"p99_ttft": 19.707702013114474,
"mean_tpot": 0.05136453982321187,
"mean_intvty": 19.46868410467284,
"median_tpot": 0.05271500683762245,
"median_intvty": 18.969930196163887,
"std_tpot": 0.005908945080237794,
"std_intvty": 169.2349457341304,
"p99_tpot": 0.06219787032757995,
"p99_intvty": 16.07772090480367,
"mean_itl": 0.5121838594842107,
"median_itl": 0.3395709656178951,
"std_itl": 0.47641326263989225,
"p99_itl": 1.7470698659494517,
"mean_e2el": 49.64183358902956,
"median_e2el": 49.71469224570319,
"std_e2el": 8.192874369531914,
"p99_e2el": 70.6121707696747
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 782.6722911502122,
"output_tput_per_gpu": 86.72509675219415,
"input_tput_per_gpu": 695.947194398018,
"mean_ttft": 0.6030681762262248,
"median_ttft": 0.33572118799202144,
"std_ttft": 0.8113796507822445,
"p99_ttft": 4.357851996324024,
"mean_tpot": 0.021695220714028927,
"mean_intvty": 46.093101018942995,
"median_tpot": 0.021852316971485555,
"median_intvty": 45.76173781960377,
"std_tpot": 0.001137275532261217,
"std_intvty": 879.2943940433895,
"p99_tpot": 0.024036563956884166,
"p99_intvty": 41.6032841380224,
"mean_itl": 0.21588798292606534,
"median_itl": 0.1781200363766402,
"std_itl": 0.12052091476378808,
"p99_itl": 0.6915661568380892,
"mean_e2el": 20.453968046803492,
"median_e2el": 20.448228939436376,
"std_e2el": 1.898436937457109,
"p99_e2el": 24.549941731705328
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 945.1093693070641,
"output_tput_per_gpu": 104.73183639584012,
"input_tput_per_gpu": 840.377532911224,
"mean_ttft": 0.5515366957212245,
"median_ttft": 0.3677878775051795,
"std_ttft": 1.1431339703432626,
"p99_ttft": 8.553269391110517,
"mean_tpot": 0.037409004758092425,
"mean_intvty": 26.731531792052746,
"median_tpot": 0.03762266212183883,
"median_intvty": 26.579724655356856,
"std_tpot": 0.0014555134089877728,
"std_intvty": 687.0427945390372,
"p99_tpot": 0.03919898863190599,
"p99_intvty": 25.510862267146624,
"mean_itl": 0.037420096225251644,
"median_itl": 0.02713847899576649,
"std_itl": 0.0450007887475665,
"p99_itl": 0.23833390001062071,
"mean_e2el": 34.93924837546177,
"median_e2el": 34.92177514899231,
"std_e2el": 2.731934553179755,
"p99_e2el": 40.6411862616759
},
{
"hw": "b200",
"conc": 64,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1484.9900590632233,
"output_tput_per_gpu": 165.1690865171544,
"input_tput_per_gpu": 1319.8209725460688,
"mean_ttft": 3.113281391416167,
"median_ttft": 1.3873272670898587,
"std_ttft": 11.534490574394775,
"p99_ttft": 83.22423392005695,
"mean_tpot": 0.0449131498543964,
"mean_intvty": 22.26519411891378,
"median_tpot": 0.045231892329865374,
"median_intvty": 22.108294579126586,
"std_tpot": 0.00313684308775457,
"std_intvty": 318.79184645981917,
"p99_tpot": 0.04866689353971899,
"p99_intvty": 20.547849416027763,
"mean_itl": 1.3274252895567766,
"median_itl": 1.2945494204759598,
"std_itl": 0.6235061393514252,
"p99_itl": 2.8226483183749953,
"mean_e2el": 44.51650563988565,
"median_e2el": 43.08752920245752,
"std_e2el": 11.939198962477846,
"p99_e2el": 122.34316254628821
},
{
"hw": "h200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 329.91581135822315,
"output_tput_per_gpu": 36.70052269737318,
"input_tput_per_gpu": 293.21528866085,
"mean_ttft": 0.4158400625921786,
"median_ttft": 0.3429237497039139,
"std_ttft": 0.22196970446768557,
"p99_ttft": 1.1694452627561986,
"mean_tpot": 0.01287355791281678,
"mean_intvty": 77.6786034422085,
"median_tpot": 0.012955562138903724,
"median_intvty": 77.18692475698458,
"std_tpot": 0.0003600893225605913,
"std_intvty": 2777.088731454214,
"p99_tpot": 0.013390182534044854,
"p99_intvty": 74.68158088640513,
"mean_itl": 0.1281300840405702,
"median_itl": 0.1203902424313128,
"std_itl": 0.047838409360930575,
"p99_itl": 0.4056673784460875,
"mean_e2el": 12.248653000942431,
"median_e2el": 12.127100224606693,
"std_e2el": 0.8834281165187826,
"p99_e2el": 14.108213157039135
},
{
"hw": "h200",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 1332.589164231567,
"output_tput_per_gpu": 147.8391377945711,
"input_tput_per_gpu": 1184.750026436996,
"mean_ttft": 4.734195847989758,
"median_ttft": 3.3858342925086617,
"std_ttft": 3.2264324921032195,
"p99_ttft": 16.36355366995562,
"mean_tpot": 0.047791538833310734,
"mean_intvty": 20.92420592456419,
"median_tpot": 0.049093611746217035,
"median_intvty": 20.36924895991292,
"std_tpot": 0.004359653538055402,
"std_intvty": 229.3760252439794,
"p99_tpot": 0.05248331161507765,
"p99_intvty": 19.05367571570532,
"mean_itl": 0.4753679620224636,
"median_itl": 0.32158594485372305,
"std_itl": 0.44275179346914384,
"p99_itl": 2.217458990626037,
"mean_e2el": 48.7599538123075,
"median_e2el": 49.13928065914661,
"std_e2el": 5.359277569037358,
"p99_e2el": 60.21960402125494
},
{
"hw": "b200",
"conc": 128,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 4429.4879622328335,
"output_tput_per_gpu": 490.6344490667089,
"input_tput_per_gpu": 3938.8535131661247,
"mean_ttft": 10.010234784092244,
"median_ttft": 1.7318915205541998,
"std_ttft": 24.866348715015917,
"p99_ttft": 93.6476610076963,
"mean_tpot": 0.05350510467253977,
"mean_intvty": 18.689805507720582,
"median_tpot": 0.05489342889627706,
"median_intvty": 18.217116695142746,
"std_tpot": 0.006733666195353514,
"std_intvty": 148.50751002329727,
"p99_tpot": 0.0652370240208414,
"p99_intvty": 15.328718852664524,
"mean_itl": 0.5335150494133725,
"median_itl": 0.3093388064298779,
"std_itl": 0.7440075934028438,
"p99_itl": 3.1679701294982805,
"mean_e2el": 59.27868219536758,
"median_e2el": 52.14545012556482,
"std_e2el": 26.072336125572637,
"p99_e2el": 146.60957308425336
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 263.8244827997229,
"output_tput_per_gpu": 29.3483855146327,
"input_tput_per_gpu": 234.4760972850902,
"mean_ttft": 5.960618521232391,
"median_ttft": 0.22662992554251105,
"std_ttft": 20.045879531801326,
"p99_ttft": 76.45129931208677,
"mean_tpot": 0.010584799438321284,
"mean_intvty": 94.47510137788656,
"median_tpot": 0.008318587044054833,
"median_intvty": 120.21272299058103,
"std_tpot": 0.014202467696871848,
"std_intvty": 70.41029920597913,
"p99_tpot": 0.0640450879209429,
"p99_intvty": 15.613999956318235,
"mean_itl": 0.10309301729656405,
"median_itl": 0.07841639150865376,
"std_itl": 1.2432662199244033,
"p99_itl": 0.17089507020078604,
"mean_e2el": 15.481258475623326,
"median_e2el": 7.944353825878352,
"std_e2el": 22.98834399822549,
"p99_e2el": 85.21141587565653
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 2801.9483216640765,
"output_tput_per_gpu": 313.3357809620345,
"input_tput_per_gpu": 2488.6125407020418,
"mean_ttft": 0.6781681046966697,
"median_ttft": 0.29109422164037824,
"std_ttft": 1.2336388407426517,
"p99_ttft": 6.549932913845406,
"mean_tpot": 0.024210675142248696,
"mean_intvty": 41.30409392239359,
"median_tpot": 0.024633202942316898,
"median_intvty": 40.59561407185581,
"std_tpot": 0.001936149575724676,
"std_intvty": 516.4890215807386,
"p99_tpot": 0.028023305288502727,
"p99_intvty": 35.68458430241901,
"mean_itl": 0.2410305415991057,
"median_itl": 0.17934838589280844,
"std_itl": 0.14775194874554615,
"p99_itl": 0.7620435172691941,
"mean_e2el": 23.09174864282686,
"median_e2el": 23.08489513769746,
"std_e2el": 2.704779265396502,
"p99_e2el": 30.304054012876005
},
{
"hw": "gb200",
"conc": 538,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 8,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 32,
"num_decode_gpu": 32,
"tput_per_gpu": 6080.273066309381,
"output_tput_per_gpu": 1350.7000212855316,
"input_tput_per_gpu": 10809.84611133323,
"mean_ttft": 1.3141834003561728,
"median_ttft": 0.7437754850834608,
"std_ttft": 1.769158793838297,
"p99_ttft": 8.875164489299062,
"mean_tpot": 0.010162166661163867,
"mean_intvty": 98.40421175354652,
"median_tpot": 0.010524230614615666,
"median_intvty": 95.01882243165943,
"std_tpot": 0.0015277926754869125,
"std_intvty": 654.5390719858614,
"p99_tpot": 0.012676291434816099,
"p99_intvty": 78.8874258013229,
"mean_itl": 0.4948975589563519,
"median_itl": 0.5069642390590161,
"std_itl": 0.07087057489117143,
"p99_itl": 0.5619539882382378,
"mean_e2el": 10.664228591188351,
"median_e2el": 10.488985555013642,
"std_e2el": 2.317289836930958,
"p99_e2el": 18.798320397834765
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "true",
"tput_per_gpu": 4039.4399914044375,
"output_tput_per_gpu": 448.14061342492425,
"input_tput_per_gpu": 3591.2993779795133,
"mean_ttft": 1.9403320959565462,
"median_ttft": 1.5385274032596499,
"std_ttft": 1.5327867870146954,
"p99_ttft": 7.444564528316262,
"mean_tpot": 0.03266003612145395,
"mean_intvty": 30.618459706574335,
"median_tpot": 0.033269112276465396,
"median_intvty": 30.057910523431705,
"std_tpot": 0.002055926279944184,
"std_intvty": 486.3987632996008,
"p99_tpot": 0.03514039065612715,
"p99_intvty": 28.457281815267407,
"mean_itl": 0.32492609184558285,
"median_itl": 0.2523008822463453,
"std_itl": 0.19926499189122818,
"p99_itl": 1.1554625461995598,
"mean_e2el": 32.03305733000743,
"median_e2el": 32.15045985812321,
"std_e2el": 3.0687930399387624,
"p99_e2el": 39.25228957726156
},
{
"hw": "gb200",
"conc": 1075,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 8,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 32,
"num_decode_gpu": 16,
"tput_per_gpu": 9233.401708226402,
"output_tput_per_gpu": 3077.373733817216,
"input_tput_per_gpu": 12311.415695430995,
"mean_ttft": 2.1064310233866212,
"median_ttft": 0.9636198990046978,
"std_ttft": 3.341042126262255,
"p99_ttft": 16.420732684521937,
"mean_tpot": 0.01804605028004771,
"mean_intvty": 55.41378775308145,
"median_tpot": 0.01845270738375185,
"median_intvty": 54.19258969448187,
"std_tpot": 0.002189632207097127,
"std_intvty": 456.6977032758097,
"p99_tpot": 0.021900060452733645,
"p99_intvty": 45.66197441136179,
"mean_itl": 0.8227767366879936,
"median_itl": 0.8399487570859492,
"std_itl": 0.13846358345664037,
"p99_itl": 1.1489782650489369,
"mean_e2el": 18.717432174078258,
"median_e2el": 18.16419094009325,
"std_e2el": 3.869985421649112,
"p99_e2el": 33.5523273987323
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "true",
"tput_per_gpu": 2500.7800853320427,
"output_tput_per_gpu": 277.43972527535664,
"input_tput_per_gpu": 2223.340360056686,
"mean_ttft": 2.1233679967459467,
"median_ttft": 1.7135784078855067,
"std_ttft": 1.2630110451786642,
"p99_ttft": 6.32179072667379,
"mean_tpot": 0.02573747158813277,
"mean_intvty": 38.85385542148933,
"median_tpot": 0.026117813198508855,
"median_intvty": 38.28804473021857,
"std_tpot": 0.0015310586502664206,
"std_intvty": 653.1428432385586,
"p99_tpot": 0.02763274285673375,
"p99_intvty": 36.188951823735174,
"mean_itl": 0.25595965327166226,
"median_itl": 0.20479795802384615,
"std_itl": 0.1421031263967647,
"p99_itl": 0.8091693890281021,
"mean_e2el": 25.82883118392856,
"median_e2el": 25.83732792409137,
"std_e2el": 2.3185534271187658,
"p99_e2el": 30.973468162892388
},
{
"hw": "mi300x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 317.6846163622444,
"output_tput_per_gpu": 35.71310902676229,
"input_tput_per_gpu": 281.9715073354821,
"mean_ttft": 0.9506415269104764,
"median_ttft": 0.5519005744718015,
"std_ttft": 1.1815322924205323,
"p99_ttft": 4.751317957779393,
"mean_tpot": 0.02649693454764344,
"mean_intvty": 37.74021474831084,
"median_tpot": 0.026639848906687438,
"median_intvty": 37.537750439304055,
"std_tpot": 0.0010727579549086575,
"std_intvty": 932.1767276805208,
"p99_tpot": 0.028324563560570074,
"p99_intvty": 35.30504531381643,
"mean_itl": 0.026673323400366718,
"median_itl": 0.02348742913454771,
"std_itl": 0.03197309771766286,
"p99_itl": 0.17916419852525697,
"mean_e2el": 25.48509730824735,
"median_e2el": 25.54096972150728,
"std_e2el": 2.391355606991649,
"p99_e2el": 31.552809298196802
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 2301.5310422927796,
"output_tput_per_gpu": 257.3752060198175,
"input_tput_per_gpu": 2044.155836272962,
"mean_ttft": 7.8412173505910685,
"median_ttft": 0.8220704429986654,
"std_ttft": 21.501532889908496,
"p99_ttft": 75.51703269016609,
"mean_tpot": 0.022066732878212047,
"mean_intvty": 45.31708456884284,
"median_tpot": 0.02219183364498633,
"median_intvty": 45.061621134940516,
"std_tpot": 0.004806934600220462,
"std_intvty": 208.03278662333716,
"p99_tpot": 0.0255744576012413,
"p99_intvty": 39.10151353323181,
"mean_itl": 0.21974702463619594,
"median_itl": 0.16498008198686875,
"std_itl": 0.46547852681888313,
"p99_itl": 0.8721297541027888,
"mean_e2el": 28.26258290100668,
"median_e2el": 21.43162561149802,
"std_e2el": 22.01860737499612,
"p99_e2el": 97.5850756896814
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 219.2058305887769,
"output_tput_per_gpu": 24.328695196707926,
"input_tput_per_gpu": 194.87713539206896,
"mean_ttft": 0.38048380316467956,
"median_ttft": 0.357451891497476,
"std_ttft": 0.12048107975009759,
"p99_ttft": 1.2580000954505521,
"mean_tpot": 0.019982858454882294,
"mean_intvty": 50.042890623372045,
"median_tpot": 0.020016966119899497,
"median_intvty": 49.95762065090716,
"std_tpot": 0.000263354860764197,
"std_intvty": 3797.1579377658845,
"p99_tpot": 0.020516017260348324,
"p99_intvty": 48.74240391348851,
"mean_itl": 0.019986112828061633,
"median_itl": 0.019038597005419433,
"std_itl": 0.012951535641965682,
"p99_itl": 0.01951627938484308,
"mean_e2el": 18.723737870464685,
"median_e2el": 18.61661264199938,
"std_e2el": 1.2901025796994765,
"p99_e2el": 20.99557187214785
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 837.2632566036258,
"output_tput_per_gpu": 92.77412495247107,
"input_tput_per_gpu": 744.4891316511547,
"mean_ttft": 7.346093868892058,
"median_ttft": 0.44564974238164723,
"std_ttft": 21.462074860963853,
"p99_ttft": 74.86600647589192,
"mean_tpot": 0.013076046499885401,
"mean_intvty": 76.47571458305568,
"median_tpot": 0.01270758546492407,
"median_intvty": 78.69315557705558,
"std_tpot": 0.005868841054208257,
"std_intvty": 170.39139257025016,
"p99_tpot": 0.014308813699425209,
"p99_intvty": 69.88699559629954,
"mean_itl": 0.13069313384604697,
"median_itl": 0.10502965981140733,
"std_itl": 0.6153074056532669,
"p99_itl": 0.38934738277457653,
"mean_e2el": 19.35270854209084,
"median_e2el": 12.195858581922948,
"std_e2el": 22.06278619611083,
"p99_e2el": 86.99084956155625
},
{
"hw": "gb200",
"conc": 1075,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 6,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 24,
"num_decode_gpu": 16,
"tput_per_gpu": 8538.301826104751,
"output_tput_per_gpu": 2371.421588106617,
"input_tput_per_gpu": 12649.555318103507,
"mean_ttft": 2.518141946902683,
"median_ttft": 1.1860549856210127,
"std_ttft": 4.169317173673007,
"p99_ttft": 20.62872563678306,
"mean_tpot": 0.02401919273797879,
"mean_intvty": 41.63337256621514,
"median_tpot": 0.023991736853464916,
"median_intvty": 41.681017348086606,
"std_tpot": 0.0005261478464042074,
"std_intvty": 1900.6064679997962,
"p99_tpot": 0.024990018246416678,
"p99_intvty": 40.01597718494624,
"mean_itl": 0.47488815467116247,
"median_itl": 0.4754273899598047,
"std_itl": 0.12395453475440078,
"p99_itl": 0.8820144593180157,
"mean_e2el": 24.636554613000804,
"median_e2el": 23.618903727037832,
"std_e2el": 4.35261833660063,
"p99_e2el": 42.57424763933988
},
{
"hw": "b200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1177.538105007635,
"output_tput_per_gpu": 130.48831401803506,
"input_tput_per_gpu": 1047.0497909896,
"mean_ttft": 2.515538578672695,
"median_ttft": 1.0159940570010804,
"std_ttft": 10.23701937122468,
"p99_ttft": 75.20961700127926,
"mean_tpot": 0.027810195781825676,
"mean_intvty": 35.958035241647345,
"median_tpot": 0.027947115062819966,
"median_intvty": 35.781868638397356,
"std_tpot": 0.002472462754287595,
"std_intvty": 404.4550310276103,
"p99_tpot": 0.030464658785577622,
"p99_intvty": 32.824920411497054,
"mean_itl": 0.8220260250175352,
"median_itl": 0.7229501069814432,
"std_itl": 0.48101436383631047,
"p99_itl": 1.8961993484816049,
"mean_e2el": 28.089281865671236,
"median_e2el": 26.6855946410069,
"std_e2el": 10.68417967478143,
"p99_e2el": 101.54249752095551
},
{
"hw": "mi325x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 180.95542048998547,
"output_tput_per_gpu": 20.129858249486677,
"input_tput_per_gpu": 160.82556224049878,
"mean_ttft": 0.6486394380044658,
"median_ttft": 0.5053747810306959,
"std_ttft": 0.509804821274982,
"p99_ttft": 2.437411218713969,
"mean_tpot": 0.02352722856969743,
"mean_intvty": 42.50394376190908,
"median_tpot": 0.023604686689607093,
"median_intvty": 42.36446825791973,
"std_tpot": 0.0005239884386049177,
"std_intvty": 1908.4390538509394,
"p99_tpot": 0.02491918904729561,
"p99_intvty": 40.12971682593846,
"mean_itl": 0.023593759353734683,
"median_itl": 0.022209363989531994,
"std_itl": 0.018567310800323842,
"p99_itl": 0.025867494596168397,
"mean_e2el": 22.27762796409079,
"median_e2el": 22.03107780049322,
"std_e2el": 1.622475479607264,
"p99_e2el": 25.969667697238037
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 512.677783401324,
"output_tput_per_gpu": 57.27014626637979,
"input_tput_per_gpu": 455.40763713494414,
"mean_ttft": 0.4901245410692354,
"median_ttft": 0.40002261751214974,
"std_ttft": 0.3389478769870244,
"p99_ttft": 2.717280547679111,
"mean_tpot": 0.01687030671907655,
"mean_intvty": 59.275745050279575,
"median_tpot": 0.016747889876596336,
"median_intvty": 59.70901453068483,
"std_tpot": 0.0007384861082538481,
"std_intvty": 1354.1216128824713,
"p99_tpot": 0.019766344978857913,
"p99_intvty": 50.59104255589995,
"mean_itl": 0.01687253422115094,
"median_itl": 0.013944172009360045,
"std_itl": 0.027909073423361717,
"p99_itl": 0.18017668889136984,
"mean_e2el": 16.095699895760916,
"median_e2el": 16.006529317994136,
"std_e2el": 1.269716810040318,
"p99_e2el": 20.155385893338817
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1471.3151439223545,
"output_tput_per_gpu": 163.03113020918914,
"input_tput_per_gpu": 1308.2840137131655,
"mean_ttft": 7.316252358042402,
"median_ttft": 0.5668224240653217,
"std_ttft": 21.028670178086657,
"p99_ttft": 73.65484245303087,
"mean_tpot": 0.016041545342873213,
"mean_intvty": 62.33813380356591,
"median_tpot": 0.015663022736634155,
"median_intvty": 63.844636939784664,
"std_tpot": 0.005818694276593332,
"std_intvty": 171.85986279132533,
"p99_tpot": 0.017514719829078873,
"p99_intvty": 57.09483278971705,
"mean_itl": 0.16023431971287266,
"median_itl": 0.12932000495493412,
"std_itl": 0.6095403496942516,
"p99_itl": 0.4950766371935606,
"mean_e2el": 22.036778840312035,
"median_e2el": 15.004365941509604,
"std_e2el": 21.546524161009085,
"p99_e2el": 88.35054903371724
},
{
"hw": "mi300x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 184.1127685340833,
"output_tput_per_gpu": 20.481088228670973,
"input_tput_per_gpu": 163.63168030541232,
"mean_ttft": 0.7100321100326255,
"median_ttft": 0.5409304718486965,
"std_ttft": 0.5939824596346324,
"p99_ttft": 2.793958247099072,
"mean_tpot": 0.02305216752571914,
"mean_intvty": 43.37986867761164,
"median_tpot": 0.023146613523796756,
"median_intvty": 43.20286416723172,
"std_tpot": 0.000585147604627715,
"std_intvty": 1708.9705094772185,
"p99_tpot": 0.024646108072951544,
"p99_intvty": 40.57435750261412,
"mean_itl": 0.023125471454715853,
"median_itl": 0.02166404016315937,
"std_itl": 0.02051715484663357,
"p99_itl": 0.0223434337601066,
"mean_e2el": 21.902791794692167,
"median_e2el": 21.598174387589097,
"std_e2el": 1.6272597230227128,
"p99_e2el": 25.82287181522697
},
{
"hw": "b200",
"conc": 8,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 482.68066497663915,
"output_tput_per_gpu": 54.26144775535788,
"input_tput_per_gpu": 428.41921722128126,
"mean_ttft": 5.0630103153846,
"median_ttft": 0.27130677935201675,
"std_ttft": 18.160536638694353,
"p99_ttft": 75.55408308801707,
"mean_tpot": 0.012677122067630133,
"mean_intvty": 78.88225692433838,
"median_tpot": 0.00978575107851566,
"median_intvty": 102.18939680526637,
"std_tpot": 0.014579133599425843,
"std_intvty": 68.5911815801854,
"p99_tpot": 0.08744661528986787,
"p99_intvty": 11.435548382120931,
"mean_itl": 0.12734670016805627,
"median_itl": 0.08708262001164258,
"std_itl": 1.4876815830522647,
"p99_itl": 0.2522495645005256,
"mean_e2el": 16.898294053721475,
"median_e2el": 9.477932222071104,
"std_e2el": 22.603062237039193,
"p99_e2el": 85.8774994056183
},
{
"hw": "gb200",
"conc": 256,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 2752.174615353429,
"output_tput_per_gpu": 457.8603816165605,
"input_tput_per_gpu": 7340.803082827166,
"mean_ttft": 1.1520327786048483,
"median_ttft": 0.6343708775239065,
"std_ttft": 1.620643642137298,
"p99_ttft": 7.9829079095111215,
"mean_tpot": 0.015427445740011703,
"mean_intvty": 64.81954413273091,
"median_tpot": 0.015426176520175089,
"median_intvty": 64.82487729167057,
"std_tpot": 9.902554535516517e-05,
"std_intvty": 10098.404370441976,
"p99_tpot": 0.01558766636266665,
"p99_intvty": 64.15328483005365,
"mean_itl": 0.30498817544223916,
"median_itl": 0.30559271411038935,
"std_itl": 0.028345733880089315,
"p99_itl": 0.332102476619184,
"mean_e2el": 15.338301514183513,
"median_e2el": 15.02746313449461,
"std_e2el": 1.8940207558440065,
"p99_e2el": 22.777227097956928
},
{
"hw": "gb200",
"conc": 538,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 4,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 16,
"num_decode_gpu": 32,
"tput_per_gpu": 4967.573349304615,
"output_tput_per_gpu": 827.6398142981578,
"input_tput_per_gpu": 13247.440419317527,
"mean_ttft": 1.8831174852832098,
"median_ttft": 0.9110365383094177,
"std_ttft": 3.073881532556376,
"p99_ttft": 15.310683140403132,
"mean_tpot": 0.017018650252751122,
"mean_intvty": 58.75906638591075,
"median_tpot": 0.01701614015714534,
"median_intvty": 58.76773409039444,
"std_tpot": 0.00014186863504673597,
"std_intvty": 7048.774379696884,
"p99_tpot": 0.017256080491768154,
"p99_intvty": 57.95058735829613,
"mean_itl": 0.33648312383923984,
"median_itl": 0.33768398710526526,
"std_itl": 0.03738728984477481,
"p99_itl": 0.3905374991241839,
"mean_e2el": 17.547954732914672,
"median_e2el": 16.77644322544802,
"std_e2el": 3.218724068873671,
"p99_e2el": 30.81878778469982
},
{
"hw": "h200",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 676.7679759239769,
"output_tput_per_gpu": 74.99022113653574,
"input_tput_per_gpu": 601.7777547874412,
"mean_ttft": 0.750748869497329,
"median_ttft": 0.4022683734074235,
"std_ttft": 1.040822630811197,
"p99_ttft": 5.500322663653641,
"mean_tpot": 0.025165212229251787,
"mean_intvty": 39.737395849879235,
"median_tpot": 0.025311677190891217,
"median_intvty": 39.507457070441184,
"std_tpot": 0.001582628770010665,
"std_intvty": 631.860117134899,
"p99_tpot": 0.028575305277262358,
"p99_intvty": 34.995251679628055,
"mean_itl": 0.2504382571738513,
"median_itl": 0.20403360202908516,
"std_itl": 0.15358003360593284,
"p99_itl": 0.843635570704937,
"mean_e2el": 23.778546434972668,
"median_e2el": 23.74889154592529,
"std_e2el": 2.4637958252572965,
"p99_e2el": 29.61175758401863
},
{
"hw": "b200",
"conc": 4,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 507.72942304803763,
"output_tput_per_gpu": 56.48087958556777,
"input_tput_per_gpu": 451.24854346246985,
"mean_ttft": 5.808408542454709,
"median_ttft": 0.2693438365822658,
"std_ttft": 19.33409899880523,
"p99_ttft": 73.78939125297126,
"mean_tpot": 0.011402564118629386,
"mean_intvty": 87.69957262210968,
"median_tpot": 0.009224447141051363,
"median_intvty": 108.4075809323814,
"std_tpot": 0.013737619209780939,
"std_intvty": 72.79281691605034,
"p99_tpot": 0.06308414925297698,
"p99_intvty": 15.851842528459072,
"mean_itl": 0.11127880160588646,
"median_itl": 0.0860871869372204,
"std_itl": 1.2052233809847699,
"p99_itl": 0.1870790808927268,
"mean_e2el": 16.08500567471492,
"median_e2el": 8.893703158479184,
"std_e2el": 22.14568341030708,
"p99_e2el": 83.36551153261681
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1208.1222212312405,
"output_tput_per_gpu": 133.86767068344633,
"input_tput_per_gpu": 1074.2545505477942,
"mean_ttft": 0.37645336196255813,
"median_ttft": 0.22251114051323384,
"std_ttft": 0.48190334738313195,
"p99_ttft": 2.6340389266418054,
"mean_tpot": 0.014139317733057849,
"mean_intvty": 70.72477037997322,
"median_tpot": 0.014257741648732471,
"median_intvty": 70.13733483443369,
"std_tpot": 0.0007879228903661586,
"std_intvty": 1269.1597264489756,
"p99_tpot": 0.015624106447406659,
"p99_intvty": 64.00366020073956,
"mean_itl": 0.14069491302045709,
"median_itl": 0.11752789799356833,
"std_itl": 0.0750810301005546,
"p99_itl": 0.4357020774792181,
"mean_e2el": 13.313350450949292,
"median_e2el": 13.300510581990238,
"std_e2el": 1.2602089084386083,
"p99_e2el": 15.897245508546183
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1224.8636238931128,
"output_tput_per_gpu": 137.69533017969786,
"input_tput_per_gpu": 1087.168293713415,
"mean_ttft": 0.3790427656866086,
"median_ttft": 0.27511315849551465,
"std_ttft": 0.30480832483486997,
"p99_ttft": 1.7339300233489479,
"mean_tpot": 0.013897737961849973,
"mean_intvty": 71.95415561475205,
"median_tpot": 0.013965920087978298,
"median_intvty": 71.60287282903677,
"std_tpot": 0.0005279469958129181,
"std_intvty": 1894.1295393872408,
"p99_tpot": 0.014688034906271992,
"p99_intvty": 68.08262687154878,
"mean_itl": 0.13830877348273976,
"median_itl": 0.12473180200322531,
"std_itl": 0.06189889661543733,
"p99_itl": 0.3536590679641813,
"mean_e2el": 13.24694507352433,
"median_e2el": 13.212132127504447,
"std_e2el": 1.1800964361262238,
"p99_e2el": 15.836200448816815
},
{
"hw": "gb200",
"conc": 2150,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 8,
"decode_tp": 16,
"decode_ep": 16,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 32,
"num_decode_gpu": 16,
"tput_per_gpu": 10299.107021437163,
"output_tput_per_gpu": 3435.7187149840856,
"input_tput_per_gpu": 13730.801174663702,
"mean_ttft": 3.8990909919013044,
"median_ttft": 1.8011172800324857,
"std_ttft": 6.164669511903034,
"p99_ttft": 30.892803065592886,
"mean_tpot": 0.03295490309860667,
"mean_intvty": 30.344498268067426,
"median_tpot": 0.033077127553692645,
"median_intvty": 30.23237124737461,
"std_tpot": 0.0017914080948650624,
"std_intvty": 558.2200967308485,
"p99_tpot": 0.03567550727398745,
"p99_intvty": 28.03043534377836,
"mean_itl": 0.6515661857164307,
"median_itl": 0.645504271844402,
"std_itl": 0.2473214988027601,
"p99_itl": 1.5211918519157916,
"mean_e2el": 34.266204246108416,
"median_e2el": 32.934241177048534,
"std_e2el": 6.120593926078832,
"p99_e2el": 59.86751466851914
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 1637.749464713586,
"output_tput_per_gpu": 181.69400989118543,
"input_tput_per_gpu": 1456.0554548224004,
"mean_ttft": 1.3771952071758278,
"median_ttft": 0.42999805556610227,
"std_ttft": 3.1501125877184,
"p99_ttft": 16.38412256734446,
"mean_tpot": 0.04176386595160758,
"mean_intvty": 23.944143512928495,
"median_tpot": 0.04281118052408689,
"median_intvty": 23.358384136063925,
"std_tpot": 0.0042352467637212015,
"std_intvty": 236.11375104891718,
"p99_tpot": 0.049281304597262376,
"p99_intvty": 20.291670607590834,
"mean_itl": 0.41589849655396766,
"median_itl": 0.2563164639286697,
"std_itl": 0.3111211880234757,
"p99_itl": 1.5731531271152197,
"mean_e2el": 39.89524440008972,
"median_e2el": 40.1207730746828,
"std_e2el": 5.834619075011206,
"p99_e2el": 56.98730580197648
},
{
"hw": "mi300x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 920.7052630820897,
"output_tput_per_gpu": 102.14421361850924,
"input_tput_per_gpu": 818.5610494635805,
"mean_ttft": 3.0407275320059854,
"median_ttft": 0.7711951732635498,
"std_ttft": 6.842779948595821,
"p99_ttft": 30.29616045454517,
"mean_tpot": 0.0738018306797155,
"mean_intvty": 13.549799385597773,
"median_tpot": 0.07622416787458494,
"median_intvty": 13.119198646357743,
"std_tpot": 0.00819397056888694,
"std_intvty": 122.04095579706713,
"p99_tpot": 0.08629243553755318,
"p99_intvty": 11.588501283693805,
"mean_itl": 0.07409451895143006,
"median_itl": 0.04545384645462036,
"std_itl": 0.16170734161961228,
"p99_itl": 0.7676080838590859,
"mean_e2el": 71.10696224763815,
"median_e2el": 71.53962715994567,
"std_e2el": 10.598528465184879,
"p99_e2el": 101.54585940668359
},
{
"hw": "gb200",
"conc": 2150,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 8,
"decode_ep": 8,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 20,
"num_decode_gpu": 8,
"tput_per_gpu": 11444.708879814294,
"output_tput_per_gpu": 4454.198518191032,
"input_tput_per_gpu": 14240.913024463598,
"mean_ttft": 6.247361848174911,
"median_ttft": 2.5925040814327076,
"std_ttft": 9.650712860191684,
"p99_ttft": 47.792469649808304,
"mean_tpot": 0.05095421694159596,
"mean_intvty": 19.62546105155941,
"median_tpot": 0.052116931600052284,
"median_intvty": 19.187622319633967,
"std_tpot": 0.0032481921557769673,
"std_intvty": 307.86355980248345,
"p99_tpot": 0.052986197743270035,
"p99_intvty": 18.872839391971915,
"mean_itl": 1.0074163484158434,
"median_itl": 1.0346973119303584,
"std_itl": 0.19531822468300844,
"p99_itl": 1.5723083199281238,
"mean_e2el": 53.19934775322382,
"median_e2el": 51.34595257346518,
"std_e2el": 9.65662227443855,
"p99_e2el": 94.25640178131634
},
{
"hw": "mi355x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 634.2345764034405,
"output_tput_per_gpu": 70.2313302564325,
"input_tput_per_gpu": 564.003246147008,
"mean_ttft": 0.4475805068438785,
"median_ttft": 0.34923084401816595,
"std_ttft": 0.6135175404200965,
"p99_ttft": 4.863295131706109,
"mean_tpot": 0.02786531021292415,
"mean_intvty": 35.886914316001125,
"median_tpot": 0.027963086582467496,
"median_intvty": 35.76143130876645,
"std_tpot": 0.0008308632773424949,
"std_intvty": 1203.5674548025356,
"p99_tpot": 0.02914472677817151,
"p99_intvty": 34.31152426342074,
"mean_itl": 0.027869627368424806,
"median_itl": 0.023072493000654504,
"std_itl": 0.02911671379150621,
"p99_itl": 0.20989779373456263,
"mean_e2el": 26.08415374837965,
"median_e2el": 26.11947019948275,
"std_e2el": 1.8270338749387993,
"p99_e2el": 29.821774125800292
},
{
"hw": "b200",
"conc": 16,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 865.1921611970172,
"output_tput_per_gpu": 95.80618696772564,
"input_tput_per_gpu": 769.3859742292916,
"mean_ttft": 2.1472244265405243,
"median_ttft": 0.7224529055092717,
"std_ttft": 9.883300575485723,
"p99_ttft": 73.46784980355238,
"mean_tpot": 0.018476181929085674,
"mean_intvty": 54.12373637790255,
"median_tpot": 0.018465151396356086,
"median_intvty": 54.15606828966158,
"std_tpot": 0.0028951210010518323,
"std_intvty": 345.4087064536121,
"p99_tpot": 0.020099600272428984,
"p99_intvty": 49.7522332009617,
"mean_itl": 0.5459599184381234,
"median_itl": 0.5457387059868779,
"std_itl": 0.5054164995183363,
"p99_itl": 1.1290990548813715,
"mean_e2el": 19.149098619363066,
"median_e2el": 17.78626521100523,
"std_e2el": 10.255626146885264,
"p99_e2el": 90.3278308190996
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 382.29189740846493,
"output_tput_per_gpu": 42.705015879137676,
"input_tput_per_gpu": 339.58688152932723,
"mean_ttft": 0.39702025984661304,
"median_ttft": 0.3544773744943086,
"std_ttft": 0.2831864619367688,
"p99_ttft": 2.492587306512578,
"mean_tpot": 0.02289633409096701,
"mean_intvty": 43.675113929898366,
"median_tpot": 0.022927307991622355,
"median_intvty": 43.616110551025024,
"std_tpot": 0.00039671269464354084,
"std_intvty": 2520.71590726012,
"p99_tpot": 0.023670400013219642,
"p99_intvty": 42.246856810257185,
"mean_itl": 0.022899045672645696,
"median_itl": 0.020637310997699387,
"std_itl": 0.019960626489843078,
"p99_itl": 0.1376733028300805,
"mean_e2el": 21.576576341488835,
"median_e2el": 21.536142249504337,
"std_e2el": 1.3642916594387298,
"p99_e2el": 24.45491708109504
},
{
"hw": "h200",
"conc": 32,
"image": "lmsysorg/sglang:v0.5.5-cu129-amd64",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 990.8747679014797,
"output_tput_per_gpu": 110.80736815716604,
"input_tput_per_gpu": 880.0673997443137,
"mean_ttft": 1.0185304569567961,
"median_ttft": 0.3712034404743463,
"std_ttft": 2.0388219174690403,
"p99_ttft": 10.207971199164168,
"mean_tpot": 0.03427877982180514,
"mean_intvty": 29.172566969956385,
"median_tpot": 0.03487023366150224,
"median_intvty": 28.67775449133366,
"std_tpot": 0.0028907192372113204,
"std_intvty": 345.9346681363289,
"p99_tpot": 0.040728756268101476,
"p99_intvty": 24.552677067214894,
"mean_itl": 0.3414085586368336,
"median_itl": 0.258241094998084,
"std_itl": 0.23986646540861153,
"p99_itl": 1.095553965724539,
"mean_e2el": 32.74605443468899,
"median_e2el": 32.57929826248437,
"std_e2el": 4.069978464541147,
"p99_e2el": 44.22934253697517
},
{
"hw": "gb200",
"conc": 269,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 20,
"num_decode_gpu": 32,
"tput_per_gpu": 4304.411851176555,
"output_tput_per_gpu": 776.8841508348211,
"input_tput_per_gpu": 9948.456171723328,
"mean_ttft": 1.1125674064146298,
"median_ttft": 0.6998088390100747,
"std_ttft": 1.322866480490014,
"p99_ttft": 6.907846250981092,
"mean_tpot": 0.008848149833086611,
"mean_intvty": 113.01797764099994,
"median_tpot": 0.009158832200532789,
"median_intvty": 109.18422546728478,
"std_tpot": 0.0012763482174741436,
"std_intvty": 783.4852482334102,
"p99_tpot": 0.010923131222227126,
"p99_intvty": 91.54884068087841,
"mean_itl": 0.4266009805769804,
"median_itl": 0.4366688629379496,
"std_itl": 0.06475242308682255,
"p99_itl": 0.46653126130346206,
"mean_e2el": 9.25882227058898,
"median_e2el": 9.20318876800593,
"std_e2el": 1.796646641481754,
"p99_e2el": 15.491295086904422
},
{
"hw": "gb200",
"conc": 128,
"image": "nvcr.io#nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1-rc0.pre3",
"model": "/mnt/lustre01/models/deepseek-r1-0528-fp4-v2",
"infmax_model_prefix": "dsr1",
"framework": "dynamo-trt",
"precision": "fp4",
"spec_decoding": "mtp",
"disagg": true,
"isl": 8192,
"osl": 1024,
"is_multinode": true,
"prefill_tp": 4,
"prefill_ep": 4,
"prefill_dp_attention": "true",
"prefill_num_workers": 5,
"decode_tp": 32,
"decode_ep": 32,
"decode_dp_attention": "true",
"decode_num_workers": 1,
"num_prefill_gpu": 20,
"num_decode_gpu": 32,
"tput_per_gpu": 2390.09381945457,
"output_tput_per_gpu": 430.77098638871456,
"input_tput_per_gpu": 5525.010352359939,
"mean_ttft": 0.8001663754903348,
"median_ttft": 0.5341741489246488,
"std_ttft": 0.7896431137215334,
"p99_ttft": 3.921679448762443,
"mean_tpot": 0.0077841085935147466,
"mean_intvty": 128.46686142497293,
"median_tpot": 0.008022537140634556,
"median_intvty": 124.64884642726669,
"std_tpot": 0.0010641101178008232,
"std_intvty": 939.7523651656293,
"p99_tpot": 0.00946293918235997,
"p99_intvty": 105.67541233532575,
"mean_itl": 0.3754476652718847,
"median_itl": 0.3812358870636672,
"std_itl": 0.05131371984107209,
"p99_itl": 0.5296694240672355,
"mean_e2el": 7.958237334162732,
"median_e2el": 7.88502879452426,
"std_e2el": 1.3768396669937686,
"p99_e2el": 12.001887772262561
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1885.8915678605954,
"output_tput_per_gpu": 208.96885010008305,
"input_tput_per_gpu": 1676.9227177605123,
"mean_ttft": 0.47780037114571317,
"median_ttft": 0.2729991546366364,
"std_ttft": 0.6147553210875283,
"p99_ttft": 3.360503045502118,
"mean_tpot": 0.018121634889792813,
"mean_intvty": 55.18265907472067,
"median_tpot": 0.018359712874874173,
"median_intvty": 54.46708272701424,
"std_tpot": 0.0010004861454797102,
"std_intvty": 999.5140907428786,
"p99_tpot": 0.019790916950085113,
"p99_intvty": 50.52822981987701,
"mean_itl": 0.18033075214456945,
"median_itl": 0.15051730605773628,
"std_itl": 0.09711162105629116,
"p99_itl": 0.56170340961311,
"mean_e2el": 17.059212865747394,
"median_e2el": 17.080937705002725,
"std_e2el": 1.5902643480506222,
"p99_e2el": 20.14071827623993
},
{
"hw": "b200-trt",
"conc": 256,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2",
"model": "nvidia/DeepSeek-R1-0528-FP4-V2",
"infmax_model_prefix": "dsr1",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "true",
"tput_per_gpu": 8379.782783857037,
"output_tput_per_gpu": 931.3547547363726,
"input_tput_per_gpu": 7448.428029120665,
"mean_ttft": 3.8291982908263207,
"median_ttft": 2.2553267095063347,
"std_ttft": 5.574688864101993,
"p99_ttft": 29.678113492431873,
"mean_tpot": 0.06317870377702225,
"mean_intvty": 15.828118340783282,
"median_tpot": 0.06497323533097561,
"median_intvty": 15.390952827052093,
"std_tpot": 0.007214136687010129,
"std_intvty": 138.6167248259398,
"p99_tpot": 0.07472527315815405,
"p99_intvty": 13.382353221827996,
"mean_itl": 0.6294555099968256,
"median_itl": 0.338651594996918,
"std_itl": 0.515379541459243,
"p99_itl": 2.5154046444897533,
"mean_e2el": 62.15341463905437,
"median_e2el": 62.25008131399227,
"std_e2el": 9.586553725730353,
"p99_e2el": 90.42316819486354
},
{
"hw": "mi300x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi30x-20250915",
"model": "deepseek-ai/DeepSeek-R1-0528",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp8",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 687.0857354165967,
"output_tput_per_gpu": 76.83530200399032,
"input_tput_per_gpu": 610.2504334126063,
"mean_ttft": 1.9010153681330848,
"median_ttft": 0.5940489638596773,
"std_ttft": 3.7868296860502357,
"p99_ttft": 16.153756891563535,
"mean_tpot": 0.04879907537911617,
"mean_intvty": 20.492191547300408,
"median_tpot": 0.04995595725133899,
"median_intvty": 20.01763263125534,
"std_tpot": 0.004245908908480345,
"std_intvty": 235.52083230111273,
"p99_tpot": 0.05595392082125036,
"p99_intvty": 17.871848573303495,
"mean_itl": 0.0489550924028865,
"median_itl": 0.03497420949861407,
"std_itl": 0.08612674640146575,
"p99_itl": 0.3490032650809735,
"mean_e2el": 47.0776920182805,
"median_e2el": 46.90478653507307,
"std_e2el": 6.00695425638055,
"p99_e2el": 64.31732356538997
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 295.1992531679218,
"output_tput_per_gpu": 32.76287238039319,
"input_tput_per_gpu": 262.4363807875286,
"mean_ttft": 0.5220288508494559,
"median_ttft": 0.46718842900008895,
"std_ttft": 0.30453560215908726,
"p99_ttft": 2.5146418720716612,
"mean_tpot": 0.014581581639174201,
"mean_intvty": 68.57966609832272,
"median_tpot": 0.014516709754308783,
"median_intvty": 68.88613307868779,
"std_tpot": 0.0005343806210948062,
"std_intvty": 1871.3253447538225,
"p99_tpot": 0.015997171651346485,
"p99_intvty": 62.51105019028972,
"mean_itl": 0.014588097895820695,
"median_itl": 0.013227311996161006,
"std_itl": 0.020423593759189076,
"p99_itl": 0.013742044462705964,
"mean_e2el": 13.910984840083838,
"median_e2el": 13.878547286003595,
"std_e2el": 1.163281147885685,
"p99_e2el": 17.072104147082413
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_sgl-dev-v0.5.2-rocm7.0-mi35x-20250915",
"model": "amd/DeepSeek-R1-0528-MXFP4-Preview",
"infmax_model_prefix": "dsr1",
"framework": "sglang",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 8192,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1465.1646234114025,
"output_tput_per_gpu": 162.9639881891013,
"input_tput_per_gpu": 1302.2006352223013,
"mean_ttft": 0.6637385652683497,
"median_ttft": 0.39956586598418653,
"std_ttft": 1.5682399987315447,
"p99_ttft": 9.975892807123932,
"mean_tpot": 0.04820004845471568,
"mean_intvty": 20.746867110299853,
"median_tpot": 0.04773250758358218,
"median_intvty": 20.950083090626368,
"std_tpot": 0.0038419758996953318,
"std_intvty": 260.2827363074557,
"p99_tpot": 0.0612119526577719,
"p99_intvty": 16.33667864821876,
"mean_itl": 0.04822319382959917,
"median_itl": 0.025938321981811896,
"std_itl": 0.08638930465146427,
"p99_itl": 0.34428852839424584,
"mean_e2el": 45.10309922204672,
"median_e2el": 44.73597106550005,
"std_e2el": 5.2234969472269395,
"p99_e2el": 63.690085355966865
}
]
[
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 3689.846262597503,
"output_tput_per_gpu": 1847.845740951926,
"input_tput_per_gpu": 1842.000521645577,
"mean_ttft": 0.060899777625854766,
"median_ttft": 0.039984301009099,
"std_ttft": 0.04556034378724078,
"p99_ttft": 0.19892437519971282,
"mean_tpot": 0.004167873679333885,
"mean_intvty": 239.930496204439,
"median_tpot": 0.004237487628479157,
"median_intvty": 235.98888956730764,
"std_tpot": 0.0002422465162983656,
"std_intvty": 4128.026339781658,
"p99_tpot": 0.004454150494983589,
"p99_intvty": 224.50970193446153,
"mean_itl": 0.08247153964403803,
"median_itl": 0.0723747450101655,
"std_itl": 0.022182491200051082,
"p99_itl": 0.160371093894355,
"mean_e2el": 3.9089700732386516,
"median_e2el": 3.9136326605075737,
"std_e2el": 0.3459265792149717,
"p99_e2el": 4.54109464191366
},
{
"hw": "b200",
"conc": 32,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 7602.883631658102,
"output_tput_per_gpu": 3807.4638176991793,
"input_tput_per_gpu": 3795.4198139589225,
"mean_ttft": 0.07669692267663777,
"median_ttft": 0.05558649997692555,
"std_ttft": 0.060559491539779256,
"p99_ttft": 0.3106096883444116,
"mean_tpot": 0.008145879867904455,
"mean_intvty": 122.76144704025106,
"median_tpot": 0.008208736043844445,
"median_intvty": 121.82143446430813,
"std_tpot": 0.00033458969045434636,
"std_intvty": 2988.7352435817106,
"p99_tpot": 0.008590196043330793,
"p99_intvty": 116.41177860851897,
"mean_itl": 0.008148782537130694,
"median_itl": 0.007484257919713855,
"std_itl": 0.004111452607141441,
"p99_itl": 0.033296516723930844,
"mean_e2el": 7.595502033735829,
"median_e2el": 7.618681324413046,
"std_e2el": 0.588960810266648,
"p99_e2el": 8.626527023629752
},
{
"hw": "b200",
"conc": 8,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1166.310584096283,
"output_tput_per_gpu": 585.316978545609,
"input_tput_per_gpu": 580.993605550674,
"mean_ttft": 0.031120946774899495,
"median_ttft": 0.0254346625006292,
"std_ttft": 0.01718574408544217,
"p99_ttft": 0.09367322003119624,
"mean_tpot": 0.0033112877013097104,
"mean_intvty": 301.99731651359406,
"median_tpot": 0.0033165577251894037,
"median_intvty": 301.517441534322,
"std_tpot": 5.359211991660067e-05,
"std_intvty": 18659.459666014078,
"p99_tpot": 0.0034078644027575396,
"p99_intvty": 293.4389053715959,
"mean_itl": 0.00331219939275741,
"median_itl": 0.003253094997489825,
"std_itl": 0.0007900252088725939,
"p99_itl": 0.003858205090509727,
"mean_e2el": 3.104800459110629,
"median_e2el": 3.125818211992737,
"std_e2el": 0.20872906961377008,
"p99_e2el": 3.433981048737769
},
{
"hw": "mi325x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2540.252134770055,
"output_tput_per_gpu": 1263.4013989180514,
"input_tput_per_gpu": 1276.8507358520035,
"mean_ttft": 0.2402334263730154,
"median_ttft": 0.09123171848477796,
"std_ttft": 0.4914258493844526,
"p99_ttft": 2.526364121992374,
"mean_tpot": 0.012187509718794058,
"mean_intvty": 82.05121662039987,
"median_tpot": 0.012149165976526296,
"median_intvty": 82.31017684112018,
"std_tpot": 0.0006208774497487629,
"std_intvty": 1610.6238041092465,
"p99_tpot": 0.013437396524229752,
"p99_intvty": 74.41917771771055,
"mean_itl": 0.01219004952440334,
"median_itl": 0.011359554948285222,
"std_itl": 0.015603720224536013,
"p99_itl": 0.055523696537129574,
"mean_e2el": 11.392833212441474,
"median_e2el": 11.488652286527213,
"std_e2el": 1.0830521522383176,
"p99_e2el": 14.501852763823699
},
{
"hw": "h200",
"conc": 32,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2123.980488408774,
"output_tput_per_gpu": 1063.672581471796,
"input_tput_per_gpu": 1060.3079069369778,
"mean_ttft": 0.0768819730321411,
"median_ttft": 0.04673737566918135,
"std_ttft": 0.09237556114311272,
"p99_ttft": 0.4888501213863492,
"mean_tpot": 0.00727211221662478,
"mean_intvty": 137.5116293879376,
"median_tpot": 0.007328341245213391,
"median_intvty": 136.45652768328222,
"std_tpot": 0.00027790309280715484,
"std_intvty": 3598.3766495680184,
"p99_tpot": 0.00764870908918309,
"p99_intvty": 130.74101633885041,
"mean_itl": 0.007274734661533487,
"median_itl": 0.006732508540153503,
"std_itl": 0.0036359498702856923,
"p99_itl": 0.02691005505621435,
"mean_e2el": 6.789211330557009,
"median_e2el": 6.798134110867977,
"std_e2el": 0.5292759948497667,
"p99_e2el": 7.811202338859439
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 492.447950446753,
"output_tput_per_gpu": 245.02584562871704,
"input_tput_per_gpu": 247.42210481803596,
"mean_ttft": 0.07249697877487052,
"median_ttft": 0.03217247300199233,
"std_ttft": 0.13130156082436156,
"p99_ttft": 0.5266118632673169,
"mean_tpot": 0.003878217161249459,
"mean_intvty": 257.85043962773517,
"median_tpot": 0.00386198190745019,
"median_intvty": 258.93440828163625,
"std_tpot": 0.00012358555384844449,
"std_intvty": 8091.560614166286,
"p99_tpot": 0.004366742227619835,
"p99_intvty": 229.0036709002323,
"mean_itl": 0.0038952170425121746,
"median_itl": 0.003805090003879741,
"std_itl": 0.0031545050053189733,
"p99_itl": 0.007535323558840901,
"mean_e2el": 3.627466609372641,
"median_e2el": 3.6375431714986917,
"std_e2el": 0.31743406697730314,
"p99_e2el": 4.569994975948939
},
{
"hw": "b200",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 5183.700668771386,
"output_tput_per_gpu": 2578.1278113524877,
"input_tput_per_gpu": 2605.572857418898,
"mean_ttft": 0.06101573557534721,
"median_ttft": 0.04918936057947576,
"std_ttft": 0.033589058814507466,
"p99_ttft": 0.1743406999274157,
"mean_tpot": 0.006021070978289163,
"mean_intvty": 166.0834100122403,
"median_tpot": 0.006014321928544655,
"median_intvty": 166.26978267556422,
"std_tpot": 0.00016412283022470227,
"std_intvty": 6092.997535022334,
"p99_tpot": 0.006324550579058726,
"p99_intvty": 158.11400154045862,
"mean_itl": 0.006024019155859474,
"median_itl": 0.005687859025783837,
"std_itl": 0.003049899054377615,
"p99_itl": 0.030966749044600856,
"mean_e2el": 5.571788319824554,
"median_e2el": 5.571907978388481,
"std_e2el": 0.42756853610269646,
"p99_e2el": 6.378625052149873
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 262.61246339646743,
"output_tput_per_gpu": 130.66729358500513,
"input_tput_per_gpu": 131.9451698114623,
"mean_ttft": 0.06873257897532312,
"median_ttft": 0.029190083994762972,
"std_ttft": 0.12546465165974446,
"p99_ttft": 0.5014870273807901,
"mean_tpot": 0.003637088200197662,
"mean_intvty": 274.9452157760853,
"median_tpot": 0.0036246977960844857,
"median_intvty": 275.88506856495235,
"std_tpot": 0.00010110284099756954,
"std_intvty": 9890.91889142897,
"p99_tpot": 0.00403327897572603,
"p99_intvty": 247.93722577050107,
"mean_itl": 0.004019715530289638,
"median_itl": 0.003828354994766414,
"std_itl": 0.004157339937469058,
"p99_itl": 0.010717550253029912,
"mean_e2el": 3.4045938289244075,
"median_e2el": 3.413628262002021,
"std_e2el": 0.2949518244162844,
"p99_e2el": 4.268707684431575
},
{
"hw": "h200",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 4174.003480897985,
"output_tput_per_gpu": 2086.531355677279,
"input_tput_per_gpu": 2087.4721252207064,
"mean_ttft": 0.16882001755875536,
"median_ttft": 0.07186553627252579,
"std_ttft": 0.2883316048010474,
"p99_ttft": 1.5043822941649727,
"mean_tpot": 0.014833687370758353,
"mean_intvty": 67.41412131762328,
"median_tpot": 0.01506714837910143,
"median_intvty": 66.36955944411014,
"std_tpot": 0.0007541673030519913,
"std_intvty": 1325.9657319445753,
"p99_tpot": 0.015604811174017704,
"p99_intvty": 64.08280041638814,
"mean_itl": 0.014842403010263237,
"median_itl": 0.012997192330658436,
"std_itl": 0.008552544520103846,
"p99_itl": 0.04618335440754891,
"mean_e2el": 13.835913319252722,
"median_e2el": 13.91339274449274,
"std_e2el": 1.175626801937161,
"p99_e2el": 15.94192433629185
},
{
"hw": "h200",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 2,
"dp_attention": "false",
"tput_per_gpu": 651.1339086699575,
"output_tput_per_gpu": 323.9827406016166,
"input_tput_per_gpu": 327.15116806834084,
"mean_ttft": 0.10503446066286415,
"median_ttft": 0.08345543593168259,
"std_ttft": 0.05166740048867742,
"p99_ttft": 0.25765551391057673,
"mean_tpot": 0.005901350217419935,
"mean_intvty": 169.45274609328288,
"median_tpot": 0.005885945776481759,
"median_intvty": 169.8962304402566,
"std_tpot": 0.00018433193722023337,
"std_intvty": 5424.995880150898,
"p99_tpot": 0.006192490271329341,
"p99_intvty": 161.48592184793696,
"mean_itl": 0.11653867604007356,
"median_itl": 0.11480593029409647,
"std_itl": 0.019518275250031785,
"p99_itl": 0.19544631127268078,
"mean_e2el": 5.515342274680734,
"median_e2el": 5.530894096940756,
"std_e2el": 0.4198278143516758,
"p99_e2el": 6.349189847251401
},
{
"hw": "h200",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1371.9515733482624,
"output_tput_per_gpu": 682.3438954310407,
"input_tput_per_gpu": 689.6076779172217,
"mean_ttft": 0.05959075075224973,
"median_ttft": 0.04171137744560838,
"std_ttft": 0.05238350020319549,
"p99_ttft": 0.2676847441960126,
"mean_tpot": 0.00567819893213398,
"mean_intvty": 176.1121813363767,
"median_tpot": 0.005714002244228618,
"median_intvty": 175.00868170117397,
"std_tpot": 0.00018616628483544877,
"std_intvty": 5371.541903432696,
"p99_tpot": 0.005929618231505807,
"p99_intvty": 168.64492130146013,
"mean_itl": 0.005680408997206748,
"median_itl": 0.005421892739832401,
"std_itl": 0.002494848458149278,
"p99_itl": 0.02410193383693695,
"mean_e2el": 5.256561317475279,
"median_e2el": 5.273034567013383,
"std_e2el": 0.4020339715456364,
"p99_e2el": 6.028402329972014
},
{
"hw": "h100",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 268.3149255037443,
"output_tput_per_gpu": 133.50465050512963,
"input_tput_per_gpu": 134.81027499861466,
"mean_ttft": 0.03349371212534606,
"median_ttft": 0.029977641068398952,
"std_ttft": 0.01234626336403335,
"p99_ttft": 0.07700940977782011,
"mean_tpot": 0.003600794839641741,
"mean_intvty": 277.71646109654347,
"median_tpot": 0.0036124863608013947,
"median_intvty": 276.8176541372906,
"std_tpot": 5.3336237236399103e-05,
"std_intvty": 18748.979152161748,
"p99_tpot": 0.0036632057768453126,
"p99_intvty": 272.9849374886011,
"mean_itl": 0.0036007176844215253,
"median_itl": 0.003573404625058174,
"std_itl": 0.0007225766712512269,
"p99_itl": 0.0037983634322881696,
"mean_e2el": 3.3343614934477954,
"median_e2el": 3.3723433148115873,
"std_e2el": 0.2282133637891943,
"p99_e2el": 3.71164050180465
},
{
"hw": "mi325x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 390.3316192830276,
"output_tput_per_gpu": 195.88926581386136,
"input_tput_per_gpu": 194.44235346916622,
"mean_ttft": 0.10713937270920723,
"median_ttft": 0.04173803247977048,
"std_ttft": 0.18408748573589903,
"p99_ttft": 0.6820602420298383,
"mean_tpot": 0.0048807055367880315,
"mean_intvty": 204.88841059198484,
"median_tpot": 0.004830019916492834,
"median_intvty": 207.0384837514538,
"std_tpot": 0.00017538033949722584,
"std_intvty": 5701.89339846624,
"p99_tpot": 0.005455788062510682,
"p99_intvty": 183.29157741142407,
"mean_itl": 0.0048822005585692895,
"median_itl": 0.004685960535425693,
"std_itl": 0.00517833065749836,
"p99_itl": 0.006427322066156198,
"mean_e2el": 4.638187316100812,
"median_e2el": 4.630835668591317,
"std_e2el": 0.4104963536576695,
"p99_e2el": 5.546853242476935
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 745.4323787746541,
"output_tput_per_gpu": 374.0978034530357,
"input_tput_per_gpu": 371.33457532161844,
"mean_ttft": 0.04381445527615142,
"median_ttft": 0.03806255898962263,
"std_ttft": 0.015589188003182056,
"p99_ttft": 0.0912609337156755,
"mean_tpot": 0.002573231984790653,
"mean_intvty": 388.61634159321846,
"median_tpot": 0.0025841249207389848,
"median_intvty": 386.97819597437615,
"std_tpot": 5.635773146899805e-05,
"std_intvty": 17743.7943993557,
"p99_tpot": 0.0026580803950445224,
"p99_intvty": 376.2113447976618,
"mean_itl": 0.050869726555483365,
"median_itl": 0.04864301098859869,
"std_itl": 0.009911396489327434,
"p99_itl": 0.08515014345292067,
"mean_e2el": 2.4327837946744695,
"median_e2el": 2.4468470125138992,
"std_e2el": 0.17229500232417488,
"p99_e2el": 2.7360667555787943
},
{
"hw": "b200",
"conc": 8,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 594.290732719538,
"output_tput_per_gpu": 298.2468484778322,
"input_tput_per_gpu": 296.0438842417058,
"mean_ttft": 0.029439074450419867,
"median_ttft": 0.024961234506918117,
"std_ttft": 0.015250177951727796,
"p99_ttft": 0.09428537746280198,
"mean_tpot": 0.0032494225567895386,
"mean_intvty": 307.7469865870599,
"median_tpot": 0.003267303210034475,
"median_intvty": 306.0628095148379,
"std_tpot": 4.605842108879209e-05,
"std_intvty": 21711.556244452793,
"p99_tpot": 0.003311175485672793,
"p99_intvty": 302.00755119350356,
"mean_itl": 0.003250720829498557,
"median_itl": 0.0032049329893197864,
"std_itl": 0.000678147054021881,
"p99_itl": 0.0036991537199355658,
"mean_e2el": 3.0454170899996824,
"median_e2el": 3.0725042970007053,
"std_e2el": 0.19896303298183382,
"p99_e2el": 3.3656229261145927
},
{
"hw": "h200",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2734.450998293654,
"output_tput_per_gpu": 1359.986738808424,
"input_tput_per_gpu": 1374.4642594852298,
"mean_ttft": 0.10729892583040054,
"median_ttft": 0.06818056956399232,
"std_ttft": 0.11054903545901344,
"p99_ttft": 0.4553413995518349,
"mean_tpot": 0.011435734086395557,
"mean_intvty": 87.44519524895591,
"median_tpot": 0.011667055029314085,
"median_intvty": 85.71143253266979,
"std_tpot": 0.0005059158189239366,
"std_intvty": 1976.6134257809165,
"p99_tpot": 0.011983248664460903,
"p99_intvty": 83.44982466780743,
"mean_itl": 0.22621637187020302,
"median_itl": 0.22301738779060543,
"std_itl": 0.03312341465782407,
"p99_itl": 0.3214122236194088,
"mean_e2el": 10.574047542730113,
"median_e2el": 10.582488624495454,
"std_e2el": 0.8625386093296042,
"p99_e2el": 12.093909448911436
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 3755.568534843961,
"output_tput_per_gpu": 1867.8423593113043,
"input_tput_per_gpu": 1887.7261755326567,
"mean_ttft": 0.050820614462099914,
"median_ttft": 0.038197883492102847,
"std_ttft": 0.03025555917268575,
"p99_ttft": 0.14446418865496524,
"mean_tpot": 0.004144876224499848,
"mean_intvty": 241.26172793511284,
"median_tpot": 0.004165408210272758,
"median_intvty": 240.07250898814505,
"std_tpot": 0.00014185824324039375,
"std_intvty": 7049.2907367067455,
"p99_tpot": 0.0043600367564603166,
"p99_intvty": 229.35586460785876,
"mean_itl": 0.08199597808513263,
"median_itl": 0.0782023879874032,
"std_itl": 0.014235161427339028,
"p99_itl": 0.13573608392907752,
"mean_e2el": 3.8446718498922565,
"median_e2el": 3.8331850554823177,
"std_e2el": 0.3069525748455887,
"p99_e2el": 4.404732652925013
},
{
"hw": "mi355x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2906.594973331196,
"output_tput_per_gpu": 1458.684685575677,
"input_tput_per_gpu": 1447.910287755519,
"mean_ttft": 0.09498267080016376,
"median_ttft": 0.04236132249934599,
"std_ttft": 0.15555567173792414,
"p99_ttft": 0.58666622710909,
"mean_tpot": 0.0052827910900757075,
"mean_intvty": 189.2938757087343,
"median_tpot": 0.005278381744459358,
"median_intvty": 189.45200412033208,
"std_tpot": 0.0001450475564647596,
"std_intvty": 6894.290564922117,
"p99_tpot": 0.005531494676286196,
"p99_intvty": 180.7829634704435,
"mean_itl": 0.005881744195618827,
"median_itl": 0.005324293990270235,
"std_itl": 0.0025529779958032533,
"p99_itl": 0.010722235595749221,
"mean_e2el": 4.998886723273972,
"median_e2el": 5.014979108498665,
"std_e2el": 0.39538563402536975,
"p99_e2el": 5.745054367018164
},
{
"hw": "h200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 779.8826441028924,
"output_tput_per_gpu": 388.04386166926116,
"input_tput_per_gpu": 391.83878243363125,
"mean_ttft": 0.06222212784923613,
"median_ttft": 0.055868140421807766,
"std_ttft": 0.022474881456288762,
"p99_ttft": 0.12926251533441246,
"mean_tpot": 0.0049509072607844306,
"mean_intvty": 201.98318153137012,
"median_tpot": 0.004984565487617009,
"median_intvty": 200.61929218991443,
"std_tpot": 0.00012511376645198052,
"std_intvty": 7992.725567764011,
"p99_tpot": 0.0050959074954152,
"p99_intvty": 196.23590124029963,
"mean_itl": 0.0049592140286394,
"median_itl": 0.004892061930149794,
"std_itl": 0.0018589299020171726,
"p99_itl": 0.006766978418454528,
"mean_e2el": 4.600894627883099,
"median_e2el": 4.615590891800821,
"std_e2el": 0.331609658128154,
"p99_e2el": 5.169714205097407
},
{
"hw": "mi355x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1822.74148502003,
"output_tput_per_gpu": 906.9359986662549,
"input_tput_per_gpu": 915.8054863537751,
"mean_ttft": 0.08504681540071032,
"median_ttft": 0.037817421994986944,
"std_ttft": 0.1413801979563726,
"p99_ttft": 0.5095247747108806,
"mean_tpot": 0.004193076280103525,
"mean_intvty": 238.48838733153468,
"median_tpot": 0.004224202581108221,
"median_intvty": 236.73107072853728,
"std_tpot": 9.673636699836318e-05,
"std_intvty": 10337.373947658387,
"p99_tpot": 0.004297221474811613,
"p99_intvty": 232.7085084772921,
"mean_itl": 0.004390007795598054,
"median_itl": 0.004129028515308164,
"std_itl": 0.0016197526158648968,
"p99_itl": 0.008463582906988449,
"mean_e2el": 3.9289374537227557,
"median_e2el": 3.9331917779927608,
"std_e2el": 0.29063928592256494,
"p99_e2el": 4.612413105488522
},
{
"hw": "b200-trt",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2444.789095771863,
"output_tput_per_gpu": 1215.9225934229416,
"input_tput_per_gpu": 1228.8665023489211,
"mean_ttft": 0.03977690251194872,
"median_ttft": 0.03018027893267572,
"std_ttft": 0.022782798955277662,
"p99_ttft": 0.10823455539066344,
"mean_tpot": 0.0031783083812044707,
"mean_intvty": 314.6327794727817,
"median_tpot": 0.003207685053480974,
"median_intvty": 311.75130454743424,
"std_tpot": 0.00010423519806943097,
"std_intvty": 9593.688298398982,
"p99_tpot": 0.0032980024291687734,
"p99_intvty": 303.21384579817897,
"mean_itl": 0.0628775671128484,
"median_itl": 0.059378013014793396,
"std_itl": 0.011127572076417049,
"p99_itl": 0.10532031035050746,
"mean_e2el": 2.9490431909041943,
"median_e2el": 2.9566558836959302,
"std_e2el": 0.23151366966817244,
"p99_e2el": 3.3711315788002683
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 816.3139966254022,
"output_tput_per_gpu": 406.17090017379246,
"input_tput_per_gpu": 410.1430964516097,
"mean_ttft": 0.031773178721778095,
"median_ttft": 0.028870083624497056,
"std_ttft": 0.009246667535557094,
"p99_ttft": 0.05990656932815909,
"mean_tpot": 0.0023570638669122798,
"mean_intvty": 424.25664150967015,
"median_tpot": 0.002373110415648697,
"median_intvty": 421.38789388215076,
"std_tpot": 4.833348662093769e-05,
"std_intvty": 20689.58955604928,
"p99_tpot": 0.0024176173404446845,
"p99_intvty": 413.6303885941127,
"mean_itl": 0.04653401700147881,
"median_itl": 0.04646766185760498,
"std_itl": 0.005915178383415054,
"p99_itl": 0.06395917939022183,
"mean_e2el": 2.1921147522400135,
"median_e2el": 2.1950971118640155,
"std_e2el": 0.14866745176852622,
"p99_e2el": 2.472352759055793
},
{
"hw": "h200",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 6063.791583570661,
"output_tput_per_gpu": 3031.212439403663,
"input_tput_per_gpu": 3032.5791441669976,
"mean_ttft": 0.21192497616684705,
"median_ttft": 0.09614164498634636,
"std_ttft": 0.34052643883013384,
"p99_ttft": 1.6337603713967839,
"mean_tpot": 0.020456969723249334,
"mean_intvty": 48.88309527405228,
"median_tpot": 0.020642605736106538,
"median_intvty": 48.44349656162221,
"std_tpot": 0.0009717959422369442,
"std_intvty": 1029.0226132228272,
"p99_tpot": 0.021648070442847514,
"p99_intvty": 46.19349344044648,
"mean_itl": 0.40481893669225133,
"median_itl": 0.3738277470692992,
"std_itl": 0.08157486761460342,
"p99_itl": 0.6622232122765854,
"mean_e2el": 19.058143936249326,
"median_e2el": 19.15571855357848,
"std_e2el": 1.5903458004234998,
"p99_e2el": 21.958249536426738
},
{
"hw": "mi300x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 4303.703392102702,
"output_tput_per_gpu": 2151.366694889545,
"input_tput_per_gpu": 2152.336697213157,
"mean_ttft": 0.2900934357094229,
"median_ttft": 0.15946087799966335,
"std_ttft": 0.426828504321982,
"p99_ttft": 2.1523251597583295,
"mean_tpot": 0.028879410469293054,
"mean_intvty": 34.62674562083882,
"median_tpot": 0.0293128312909919,
"median_intvty": 34.11475302651194,
"std_tpot": 0.0015534647074687828,
"std_intvty": 643.7223808125008,
"p99_tpot": 0.03064007476261126,
"p99_intvty": 32.636996082668055,
"mean_itl": 0.02889231430244719,
"median_itl": 0.025754238478839397,
"std_itl": 0.014622115816816696,
"p99_itl": 0.0862069394439459,
"mean_e2el": 26.894542418440686,
"median_e2el": 27.13115595234558,
"std_e2el": 2.286433584332014,
"p99_e2el": 30.759613677710295
},
{
"hw": "mi300x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2235.205364318532,
"output_tput_per_gpu": 1117.350788127591,
"input_tput_per_gpu": 1117.8545761909409,
"mean_ttft": 0.17665632148564328,
"median_ttft": 0.11829727562144399,
"std_ttft": 0.21919032732422147,
"p99_ttft": 1.187189528280869,
"mean_tpot": 0.013778511688510114,
"mean_intvty": 72.57677916214267,
"median_tpot": 0.014017333487684996,
"median_intvty": 71.34024462488213,
"std_tpot": 0.0007730535389494044,
"std_intvty": 1293.5714664200625,
"p99_tpot": 0.01461733395816613,
"p99_intvty": 68.41192811643599,
"mean_itl": 0.013788983247692657,
"median_itl": 0.011828137561678886,
"std_itl": 0.010047106474183727,
"p99_itl": 0.07539001777768137,
"mean_e2el": 12.873745681506989,
"median_e2el": 12.966230806428939,
"std_e2el": 1.1568944515686155,
"p99_e2el": 14.950940526369958
},
{
"hw": "mi300x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 378.44199989734636,
"output_tput_per_gpu": 189.92241942681895,
"input_tput_per_gpu": 188.51958047052742,
"mean_ttft": 0.05756633165874518,
"median_ttft": 0.0412320161703974,
"std_ttft": 0.04685617921427083,
"p99_ttft": 0.21018253975547846,
"mean_tpot": 0.005091825454788144,
"mean_intvty": 196.39322063949402,
"median_tpot": 0.0051055262606925144,
"median_intvty": 195.8661945780218,
"std_tpot": 9.119111687782789e-05,
"std_intvty": 10965.980396311374,
"p99_tpot": 0.005243164657316769,
"p99_intvty": 190.7245080706197,
"mean_itl": 0.005092564446553724,
"median_itl": 0.004963968647643924,
"std_itl": 0.001885437482566783,
"p99_itl": 0.005453545949421823,
"mean_e2el": 4.783847927715397,
"median_e2el": 4.848951022140682,
"std_e2el": 0.32350690593180875,
"p99_e2el": 5.3514591984357684
},
{
"hw": "h100",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 3108.3733395711342,
"output_tput_per_gpu": 1553.836375040863,
"input_tput_per_gpu": 1554.5369645302712,
"mean_ttft": 0.11967748561874032,
"median_ttft": 0.06063264515250921,
"std_ttft": 0.18782449364880877,
"p99_ttft": 0.993330803848803,
"mean_tpot": 0.00993009534922432,
"mean_intvty": 100.70396756845986,
"median_tpot": 0.01008288420004205,
"median_intvty": 99.1779713185469,
"std_tpot": 0.00048003802935114544,
"std_intvty": 2083.168288461798,
"p99_tpot": 0.010409006692914375,
"p99_intvty": 96.07064626836302,
"mean_itl": 0.009937018495864992,
"median_itl": 0.008775560185313225,
"std_itl": 0.005397391892889081,
"p99_itl": 0.029771881774067886,
"mean_e2el": 9.26886103359575,
"median_e2el": 9.328511987812817,
"std_e2el": 0.7858766399850974,
"p99_e2el": 10.684018194898965
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 442.6563488001003,
"output_tput_per_gpu": 220.2511881494623,
"input_tput_per_gpu": 222.405160650638,
"mean_ttft": 0.03356184936128557,
"median_ttft": 0.03003144497051835,
"std_ttft": 0.009494864790928474,
"p99_ttft": 0.06250933577306568,
"mean_tpot": 0.0021659146734921983,
"mean_intvty": 461.69870504993463,
"median_tpot": 0.0021716386619433977,
"median_intvty": 460.4817631608662,
"std_tpot": 2.903948083058939e-05,
"std_intvty": 34435.87734346227,
"p99_tpot": 0.0022096080199233525,
"p99_intvty": 452.56895837782497,
"mean_itl": 0.04277207048115354,
"median_itl": 0.04226875165477395,
"std_itl": 0.005793974330468145,
"p99_itl": 0.06152370734140277,
"mean_e2el": 2.0192550661507993,
"median_e2el": 2.031422194559127,
"std_e2el": 0.1408628185065445,
"p99_e2el": 2.2704821038618683
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 5774.795309920265,
"output_tput_per_gpu": 2886.746870698456,
"input_tput_per_gpu": 2888.048439221809,
"mean_ttft": 0.09707754490282242,
"median_ttft": 0.057446176491794176,
"std_ttft": 0.13063448362804925,
"p99_ttft": 0.5468677142515662,
"mean_tpot": 0.0053051318634365794,
"mean_intvty": 188.4967284021883,
"median_tpot": 0.005434214756501498,
"median_intvty": 184.01922721283682,
"std_tpot": 0.00039619553218305523,
"std_intvty": 2524.006251382884,
"p99_tpot": 0.005796456938107605,
"p99_intvty": 172.5191803678739,
"mean_itl": 0.10502884128767027,
"median_itl": 0.08440469499328174,
"std_itl": 0.03595157253536747,
"p99_itl": 0.21671091812837404,
"mean_e2el": 4.986662261672336,
"median_e2el": 5.039414613493136,
"std_e2el": 0.49534409098414867,
"p99_e2el": 5.88478660508612
},
{
"hw": "h100",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1324.4580445761196,
"output_tput_per_gpu": 658.7228580273219,
"input_tput_per_gpu": 665.7351865487977,
"mean_ttft": 0.06297958627110348,
"median_ttft": 0.042762892320752144,
"std_ttft": 0.06063905315549098,
"p99_ttft": 0.2995830394141376,
"mean_tpot": 0.005878028857498404,
"mean_intvty": 170.12505794767128,
"median_tpot": 0.005924824110043353,
"median_intvty": 168.78138176369978,
"std_tpot": 0.0001711143993457388,
"std_intvty": 5844.043539430527,
"p99_tpot": 0.006114509332885497,
"p99_intvty": 163.54542050033803,
"mean_itl": 0.005885052476545372,
"median_itl": 0.005608377978205681,
"std_itl": 0.0025759224712947624,
"p99_itl": 0.02451064478605974,
"mean_e2el": 5.4429472331074065,
"median_e2el": 5.426021173596382,
"std_e2el": 0.41190147622088125,
"p99_e2el": 6.160835990160703
},
{
"hw": "h200",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2151.4931533935423,
"output_tput_per_gpu": 1075.504116521442,
"input_tput_per_gpu": 1075.9890368721003,
"mean_ttft": 0.0888276619501994,
"median_ttft": 0.04546400858089328,
"std_ttft": 0.13274697951150907,
"p99_ttft": 0.7169457670208067,
"mean_tpot": 0.007140376693562769,
"mean_intvty": 140.0486336948477,
"median_tpot": 0.0072130653622440085,
"median_intvty": 138.63731295634574,
"std_tpot": 0.0002621712924505747,
"std_intvty": 3814.3001495425847,
"p99_tpot": 0.0074458155308428295,
"p99_intvty": 134.30362273382897,
"mean_itl": 0.007149719053727423,
"median_itl": 0.006397398188710213,
"std_itl": 0.0036711461251351584,
"p99_itl": 0.02003995114937425,
"mean_e2el": 6.667239323667308,
"median_e2el": 6.689380411989987,
"std_e2el": 0.5274032205409093,
"p99_e2el": 7.7379434762336325
},
{
"hw": "h200",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1829.1741054487027,
"output_tput_per_gpu": 917.9773168779853,
"input_tput_per_gpu": 911.1967885707173,
"mean_ttft": 0.0850738839013502,
"median_ttft": 0.0604432737454772,
"std_ttft": 0.06269311336017995,
"p99_ttft": 0.2853093017451465,
"mean_tpot": 0.008466434913501339,
"mean_intvty": 118.11346927209115,
"median_tpot": 0.00849788077505078,
"median_intvty": 117.67639797159009,
"std_tpot": 0.000291628779412806,
"std_intvty": 3429.01685496712,
"p99_tpot": 0.009026013903366947,
"p99_intvty": 110.79087742452657,
"mean_itl": 0.1673751711381348,
"median_itl": 0.16569185256958008,
"std_itl": 0.022940922945342068,
"p99_itl": 0.24149851642549042,
"mean_e2el": 7.945430209138431,
"median_e2el": 8.02142412867397,
"std_e2el": 0.6094382510990963,
"p99_e2el": 9.001522548608481
},
{
"hw": "mi355x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 6793.022222485577,
"output_tput_per_gpu": 3401.891647695733,
"input_tput_per_gpu": 3391.130574789844,
"mean_ttft": 0.13105118227549611,
"median_ttft": 0.05332874599844217,
"std_ttft": 0.23187326349359746,
"p99_ttft": 0.873203586199088,
"mean_tpot": 0.009074864464826728,
"mean_intvty": 110.19448322075779,
"median_tpot": 0.009131714051163777,
"median_intvty": 109.50846625257134,
"std_tpot": 0.0003178001225201166,
"std_intvty": 3146.631889472291,
"p99_tpot": 0.00959126797166342,
"p99_intvty": 104.2615014984895,
"mean_itl": 0.009076840977743165,
"median_itl": 0.008468348998576403,
"std_itl": 0.005971834017810221,
"p99_itl": 0.027974168001674123,
"mean_e2el": 8.506167110870319,
"median_e2el": 8.488911236010608,
"std_e2el": 0.7089560247710638,
"p99_e2el": 10.503079252130119
},
{
"hw": "h100",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2083.644818956071,
"output_tput_per_gpu": 1041.5875953967864,
"input_tput_per_gpu": 1042.0572235592847,
"mean_ttft": 0.08230618546367623,
"median_ttft": 0.043980548158288,
"std_ttft": 0.1262151447149111,
"p99_ttft": 0.6334015774540603,
"mean_tpot": 0.007389984814516857,
"mean_intvty": 135.31827535499178,
"median_tpot": 0.007463216461181624,
"median_intvty": 133.99048589857912,
"std_tpot": 0.0002882964648263528,
"std_intvty": 3468.651620831777,
"p99_tpot": 0.007714241311445699,
"p99_intvty": 129.6303757721825,
"mean_itl": 0.007394160389058711,
"median_itl": 0.006645852699875832,
"std_itl": 0.003490280153369781,
"p99_itl": 0.019942887406796214,
"mean_e2el": 6.89094137926877,
"median_e2el": 6.915218213573098,
"std_e2el": 0.5504483737840518,
"p99_e2el": 7.957350925933571
},
{
"hw": "b200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1118.4790890649836,
"output_tput_per_gpu": 556.5182764342069,
"input_tput_per_gpu": 561.9608126307767,
"mean_ttft": 0.03341413737216499,
"median_ttft": 0.032102871511597186,
"std_ttft": 0.005685398916507836,
"p99_ttft": 0.04957427273708163,
"mean_tpot": 0.0034525770600842055,
"mean_intvty": 289.6387198887346,
"median_tpot": 0.003462866971723366,
"median_intvty": 288.7780582291123,
"std_tpot": 4.5084861675213086e-05,
"std_intvty": 22180.39410221333,
"p99_tpot": 0.0035007101582553526,
"p99_intvty": 285.6563253720981,
"mean_itl": 0.003452897230135752,
"median_itl": 0.003417217027163133,
"std_itl": 0.000815360794126231,
"p99_itl": 0.003676487789489329,
"mean_e2el": 3.1984259208962613,
"median_e2el": 3.2247179139812943,
"std_e2el": 0.2190748779404474,
"p99_e2el": 3.5690018066373885
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 12599.83318880548,
"output_tput_per_gpu": 6298.496669938069,
"input_tput_per_gpu": 6301.336518867411,
"mean_ttft": 0.08405005262761733,
"median_ttft": 0.04759459248452913,
"std_ttft": 0.11410061912123168,
"p99_ttft": 0.5518491027262644,
"mean_tpot": 0.009853574207210808,
"mean_intvty": 101.48601705036167,
"median_tpot": 0.009941471826224367,
"median_intvty": 100.58872745201815,
"std_tpot": 0.00039428887698954025,
"std_intvty": 2536.2115402168147,
"p99_tpot": 0.010265541282684778,
"p99_intvty": 97.41327539023514,
"mean_itl": 0.1949582642681052,
"median_itl": 0.18498067400651053,
"std_itl": 0.030269078527553218,
"p99_itl": 0.28186715915449895,
"mean_e2el": 9.160270920234506,
"median_e2el": 9.204064904013649,
"std_e2el": 0.7045756543725008,
"p99_e2el": 10.401681175304693
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 19485.37715178575,
"output_tput_per_gpu": 9731.998757870213,
"input_tput_per_gpu": 9753.378393915536,
"mean_ttft": 0.11183877375224256,
"median_ttft": 0.0532843554392457,
"std_ttft": 0.19017217675510906,
"p99_ttft": 0.9088350282981993,
"mean_tpot": 0.012731874464674475,
"mean_intvty": 78.5430301543244,
"median_tpot": 0.0128911192246033,
"median_intvty": 77.57278344703022,
"std_tpot": 0.0005995892446236015,
"std_intvty": 1667.8084354694531,
"p99_tpot": 0.013369942261226264,
"p99_intvty": 74.79463863505735,
"mean_itl": 0.25184639775961165,
"median_itl": 0.23502787295728922,
"std_itl": 0.04940827368101555,
"p99_itl": 0.37498086184263224,
"mean_e2el": 11.816203212377513,
"median_e2el": 11.818341600010172,
"std_e2el": 0.9805070204574968,
"p99_e2el": 13.59356378538534
},
{
"hw": "h200",
"conc": 32,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1335.9670507668936,
"output_tput_per_gpu": 669.0417023157686,
"input_tput_per_gpu": 666.925348451125,
"mean_ttft": 0.06407950388966129,
"median_ttft": 0.03996612038463354,
"std_ttft": 0.07118432979400208,
"p99_ttft": 0.3699183241650462,
"mean_tpot": 0.005761450068111995,
"mean_intvty": 173.56741587238923,
"median_tpot": 0.005789639746684153,
"median_intvty": 172.72231844351296,
"std_tpot": 0.00014755325022672273,
"std_intvty": 6777.214317295292,
"p99_tpot": 0.005938363966595747,
"p99_intvty": 168.39654922216977,
"mean_itl": 0.005762615949911084,
"median_itl": 0.005395474843680859,
"std_itl": 0.0024380428201459785,
"p99_itl": 0.018466039188206204,
"mean_e2el": 5.381191056221724,
"median_e2el": 5.375404603313655,
"std_e2el": 0.3827672555316937,
"p99_e2el": 6.149981052372604
},
{
"hw": "h200",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 1670.5313321292383,
"output_tput_per_gpu": 836.588840703818,
"input_tput_per_gpu": 833.9424914254203,
"mean_ttft": 0.1471464057569392,
"median_ttft": 0.08443904994055629,
"std_ttft": 0.14412841550418926,
"p99_ttft": 0.6021225585229695,
"mean_tpot": 0.009198650609217133,
"mean_intvty": 108.71159722035652,
"median_tpot": 0.009325566935268162,
"median_intvty": 107.23208647166763,
"std_tpot": 0.0005924984604390484,
"std_intvty": 1687.7680986022954,
"p99_tpot": 0.010059723200499113,
"p99_intvty": 99.40631367972281,
"mean_itl": 0.18206555076159536,
"median_itl": 0.15106271766126156,
"std_itl": 0.06383845682516495,
"p99_itl": 0.4229519840329886,
"mean_e2el": 8.642211029259489,
"median_e2el": 8.688553277868778,
"std_e2el": 0.8284394983735118,
"p99_e2el": 10.212095683347433
},
{
"hw": "h200",
"conc": 32,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2849.728790492925,
"output_tput_per_gpu": 1427.121574619069,
"input_tput_per_gpu": 1422.607215873856,
"mean_ttft": 0.11691036677511875,
"median_ttft": 0.06634425185620785,
"std_ttft": 0.14874367987387893,
"p99_ttft": 0.7904138263687491,
"mean_tpot": 0.010852529944880932,
"mean_intvty": 92.14441287689729,
"median_tpot": 0.010964475755287795,
"median_intvty": 91.20363091849,
"std_tpot": 0.0004779237666667095,
"std_intvty": 2092.383910878765,
"p99_tpot": 0.01142224293415836,
"p99_intvty": 87.54847938048032,
"mean_itl": 0.010857070085375377,
"median_itl": 0.009991343133151531,
"std_itl": 0.005975399014141829,
"p99_itl": 0.04266284964978695,
"mean_e2el": 10.134627037559403,
"median_e2el": 10.133675295393914,
"std_e2el": 0.811558028604555,
"p99_e2el": 11.538630820075053
},
{
"hw": "h200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 480.6477869361264,
"output_tput_per_gpu": 239.1544737606336,
"input_tput_per_gpu": 241.4933131754928,
"mean_ttft": 0.0442640034481883,
"median_ttft": 0.03996308520436287,
"std_ttft": 0.017122698136543463,
"p99_ttft": 0.10424251141957938,
"mean_tpot": 0.004015945971599837,
"mean_intvty": 249.00733403084823,
"median_tpot": 0.00403544465628369,
"median_intvty": 247.80416662210334,
"std_tpot": 7.436949945277176e-05,
"std_intvty": 13446.37260379907,
"p99_tpot": 0.004092002229797025,
"p99_intvty": 244.37914347119084,
"mean_itl": 0.004016024193557151,
"median_itl": 0.0039696525782346725,
"std_itl": 0.001159950412438905,
"p99_itl": 0.004268577173352241,
"mean_e2el": 3.7258535858942197,
"median_e2el": 3.761431817896664,
"std_e2el": 0.264076283817283,
"p99_e2el": 4.181595285190269
},
{
"hw": "mi300x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1590.686389614992,
"output_tput_per_gpu": 791.1322590272727,
"input_tput_per_gpu": 799.5541305877194,
"mean_ttft": 0.10361180019099266,
"median_ttft": 0.07208590302616358,
"std_ttft": 0.08842781581134496,
"p99_ttft": 0.45339922863524407,
"mean_tpot": 0.009801451540241653,
"mean_intvty": 102.02570465143015,
"median_tpot": 0.0098773918372127,
"median_intvty": 101.2413009912736,
"std_tpot": 0.0003097531041759417,
"std_intvty": 3228.377654714297,
"p99_tpot": 0.010306214164571349,
"p99_intvty": 97.02883949739768,
"mean_itl": 0.009805007552961539,
"median_itl": 0.009304794017225504,
"std_itl": 0.004746552884276571,
"p99_itl": 0.04270443571731448,
"mean_e2el": 9.074151789044844,
"median_e2el": 9.047985482495278,
"std_e2el": 0.7169629028697502,
"p99_e2el": 10.556291878712363
},
{
"hw": "h100",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 813.2421120666297,
"output_tput_per_gpu": 404.468205333112,
"input_tput_per_gpu": 408.7739067335177,
"mean_ttft": 0.04553340438287705,
"median_ttft": 0.03422716725617647,
"std_ttft": 0.03674377295304306,
"p99_ttft": 0.19010984424501656,
"mean_tpot": 0.004791454827529071,
"mean_intvty": 208.70487899719905,
"median_tpot": 0.00483588515611919,
"median_intvty": 206.78737557169418,
"std_tpot": 0.0001234757304440364,
"std_intvty": 8098.757516184411,
"p99_tpot": 0.004919601258304561,
"p99_intvty": 203.2685064286347,
"mean_itl": 0.004793241653376427,
"median_itl": 0.0046340785920619965,
"std_itl": 0.0016274903533592508,
"p99_itl": 0.016643524467944853,
"mean_e2el": 4.430840095481836,
"median_e2el": 4.426068661734462,
"std_e2el": 0.3328737711657596,
"p99_e2el": 5.0319521415233615
},
{
"hw": "mi300x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 192.9646659445023,
"output_tput_per_gpu": 96.01284847793667,
"input_tput_per_gpu": 96.95181746656561,
"mean_ttft": 0.17265673263464124,
"median_ttft": 0.039941615890711546,
"std_ttft": 0.44594034503964913,
"p99_ttft": 1.7317876462917774,
"mean_tpot": 0.004866548462841776,
"mean_intvty": 205.48444295488616,
"median_tpot": 0.0048134972146922925,
"median_intvty": 207.74915937370622,
"std_tpot": 0.00035366153161054905,
"std_intvty": 2827.5622611429417,
"p99_tpot": 0.00625972119430101,
"p99_intvty": 159.7515239034004,
"mean_itl": 0.004863776741823926,
"median_itl": 0.00474671833217144,
"std_itl": 0.009213703954414731,
"p99_itl": 0.005255762599408627,
"mean_e2el": 4.631402292149142,
"median_e2el": 4.532530876342207,
"std_e2el": 0.6474830689158965,
"p99_e2el": 6.811301565617323
},
{
"hw": "b200",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 8596.421867323945,
"output_tput_per_gpu": 4297.2421692715925,
"input_tput_per_gpu": 4299.179698052352,
"mean_ttft": 0.067390430571686,
"median_ttft": 0.04238340584561229,
"std_ttft": 0.07583908656221115,
"p99_ttft": 0.4161999624059536,
"mean_tpot": 0.0071614845440692645,
"mean_intvty": 139.63585257307344,
"median_tpot": 0.007206936425890223,
"median_intvty": 138.75521315931087,
"std_tpot": 0.00025052893059148313,
"std_intvty": 3991.5549778584955,
"p99_tpot": 0.007519826691600383,
"p99_intvty": 132.98178814639385,
"mean_itl": 0.007164606156170143,
"median_itl": 0.006410015048459172,
"std_itl": 0.003433558137298941,
"p99_itl": 0.021899100719019775,
"mean_e2el": 6.664660402845039,
"median_e2el": 6.6920201575849205,
"std_e2el": 0.5020486444513507,
"p99_e2el": 7.626528558027931
},
{
"hw": "mi325x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 960.2915404113971,
"output_tput_per_gpu": 477.60364371655265,
"input_tput_per_gpu": 482.6878966948445,
"mean_ttft": 0.24975964457189548,
"median_ttft": 0.055250032513868064,
"std_ttft": 0.5699332410366793,
"p99_ttft": 2.5699172115966213,
"mean_tpot": 0.007943532239831073,
"mean_intvty": 125.88858077338979,
"median_tpot": 0.007848729952790139,
"median_intvty": 127.40914848835011,
"std_tpot": 0.0004114202564289433,
"std_intvty": 2430.604678242698,
"p99_tpot": 0.009197266102529569,
"p99_intvty": 108.72796207613968,
"mean_itl": 0.0079430108604305,
"median_itl": 0.007477378007024527,
"std_itl": 0.011416839145424504,
"p99_itl": 0.02555898295715449,
"mean_e2el": 7.516770313523011,
"median_e2el": 7.4026557435281575,
"std_e2el": 0.9250704483265453,
"p99_e2el": 10.799798602478113
},
{
"hw": "b200",
"conc": 32,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 3409.8948928116342,
"output_tput_per_gpu": 1707.6483154992156,
"input_tput_per_gpu": 1702.2465773124186,
"mean_ttft": 0.0416878104369971,
"median_ttft": 0.030485301977023482,
"std_ttft": 0.03459780741710905,
"p99_ttft": 0.17238684496842324,
"mean_tpot": 0.004519100120066892,
"mean_intvty": 221.28299294798492,
"median_tpot": 0.004540954249519365,
"median_intvty": 220.2180301873432,
"std_tpot": 0.00011061444883897203,
"std_intvty": 9040.41027638043,
"p99_tpot": 0.0046465194411727615,
"p99_intvty": 215.21485332419147,
"mean_itl": 0.004520171430204253,
"median_itl": 0.004270574077963829,
"std_itl": 0.001659145698192518,
"p99_itl": 0.013718095347285273,
"mean_e2el": 4.211969591017987,
"median_e2el": 4.199791441904381,
"std_e2el": 0.29246812977438796,
"p99_e2el": 4.775670764553361
},
{
"hw": "b200-trt",
"conc": 128,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "true",
"tput_per_gpu": 18470.698679946858,
"output_tput_per_gpu": 9225.216181856857,
"input_tput_per_gpu": 9245.482498090001,
"mean_ttft": 0.1485691556459642,
"median_ttft": 0.10492504294961691,
"std_ttft": 0.12752582800690018,
"p99_ttft": 0.6648889052448795,
"mean_tpot": 0.006538302975539189,
"mean_intvty": 152.94488550640065,
"median_tpot": 0.0065679520784519595,
"median_intvty": 152.2544604551524,
"std_tpot": 0.00014506661809914897,
"std_intvty": 6893.384660808236,
"p99_tpot": 0.006687160267251338,
"p99_intvty": 149.54030710124368,
"mean_itl": 0.12932629322634306,
"median_itl": 0.12242581136524677,
"std_itl": 0.02126095126562136,
"p99_itl": 0.17934575935825706,
"mean_e2el": 6.158907486472526,
"median_e2el": 6.153879202203825,
"std_e2el": 0.44091868266114037,
"p99_e2el": 7.053785444018431
},
{
"hw": "b200",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2069.606114440012,
"output_tput_per_gpu": 1029.324303837088,
"input_tput_per_gpu": 1040.2818106029238,
"mean_ttft": 0.03538799689904408,
"median_ttft": 0.026959219481796026,
"std_ttft": 0.024226846353100715,
"p99_ttft": 0.13741304295457668,
"mean_tpot": 0.003761543989010653,
"mean_intvty": 265.8482801002724,
"median_tpot": 0.003771282736369447,
"median_intvty": 265.1617685293688,
"std_tpot": 5.7818042228765475e-05,
"std_intvty": 17295.63924083342,
"p99_tpot": 0.0038402638455326517,
"p99_intvty": 260.39877472567207,
"mean_itl": 0.0037628044242821097,
"median_itl": 0.003638654015958309,
"std_itl": 0.0011475618592218305,
"p99_itl": 0.012443348205997608,
"mean_e2el": 3.477577855441814,
"median_e2el": 3.4750734940025723,
"std_e2el": 0.246466037836642,
"p99_e2el": 3.9097689787996934
},
{
"hw": "mi325x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1082.548345792989,
"output_tput_per_gpu": 542.1316249474144,
"input_tput_per_gpu": 540.4167208455747,
"mean_ttft": 0.22197412862369675,
"median_ttft": 0.050247508042957634,
"std_ttft": 0.49073397893673254,
"p99_ttft": 1.629531470873626,
"mean_tpot": 0.006962056980674977,
"mean_intvty": 143.63571036200412,
"median_tpot": 0.006858039442964926,
"median_intvty": 145.81426781174525,
"std_tpot": 0.0005243611927643491,
"std_intvty": 1907.0823962546856,
"p99_tpot": 0.008605183031084951,
"p99_intvty": 116.20903313591911,
"mean_itl": 0.006964210775216959,
"median_itl": 0.006198632065206766,
"std_itl": 0.012559605515812472,
"p99_itl": 0.020478454581461854,
"mean_e2el": 6.647785803938677,
"median_e2el": 6.450334930443205,
"std_e2el": 1.0331581487220913,
"p99_e2el": 10.210674027154456
},
{
"hw": "mi300x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2300.972117994214,
"output_tput_per_gpu": 1152.3085856947373,
"input_tput_per_gpu": 1148.6635322994766,
"mean_ttft": 0.35225916631752624,
"median_ttft": 0.0827213553711772,
"std_ttft": 0.7851348170687767,
"p99_ttft": 4.097007054276768,
"mean_tpot": 0.013234911151203791,
"mean_intvty": 75.55774183712931,
"median_tpot": 0.013193516886993484,
"median_intvty": 75.79480198989447,
"std_tpot": 0.0007361735421761524,
"std_intvty": 1358.3753594892426,
"p99_tpot": 0.015337228994400137,
"p99_intvty": 65.20082606611113,
"mean_itl": 0.013240587794458414,
"median_itl": 0.011910110712051392,
"std_itl": 0.021779698880498414,
"p99_itl": 0.04591701291501522,
"mean_e2el": 12.569225041219033,
"median_e2el": 12.323776661418378,
"std_e2el": 1.5602564562028431,
"p99_e2el": 17.851580222770572
},
{
"hw": "b200",
"conc": 128,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 13024.595062136772,
"output_tput_per_gpu": 6505.152144558875,
"input_tput_per_gpu": 6519.442917577897,
"mean_ttft": 0.09536730952495417,
"median_ttft": 0.05258618202060461,
"std_ttft": 0.14507192523832738,
"p99_ttft": 0.7550483654579165,
"mean_tpot": 0.009456405050475673,
"mean_intvty": 105.74843131848485,
"median_tpot": 0.009534727286017902,
"median_intvty": 104.87976949969395,
"std_tpot": 0.0003213049750598306,
"std_intvty": 3112.307862067149,
"p99_tpot": 0.00986397054058091,
"p99_intvty": 101.37905378832447,
"mean_itl": 0.009461157220907646,
"median_itl": 0.008179468102753162,
"std_itl": 0.004250043019284761,
"p99_itl": 0.026947249192744493,
"mean_e2el": 8.79017805372423,
"median_e2el": 8.766769763082266,
"std_e2el": 0.6857811016063394,
"p99_e2el": 10.087665208554828
},
{
"hw": "mi355x",
"conc": 128,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 15470.369884262402,
"output_tput_per_gpu": 7726.697785966975,
"input_tput_per_gpu": 7743.672098295427,
"mean_ttft": 0.24864428000769295,
"median_ttft": 0.08338191649818327,
"std_ttft": 0.5408402993373919,
"p99_ttft": 2.5164705246762606,
"mean_tpot": 0.015943068562178697,
"mean_intvty": 62.72318256049356,
"median_tpot": 0.016212572694485512,
"median_intvty": 61.680525283944384,
"std_tpot": 0.0008991920572367116,
"std_intvty": 1112.1094675514364,
"p99_tpot": 0.016823640832647545,
"p99_intvty": 59.44016577311996,
"mean_itl": 0.016103085443386563,
"median_itl": 0.013609938992885873,
"std_itl": 0.009659541695855877,
"p99_itl": 0.04326803875039332,
"mean_e2el": 14.909081870896125,
"median_e2el": 14.889266546510044,
"std_e2el": 1.404033413466267,
"p99_e2el": 18.120928058772698
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "true",
"tput_per_gpu": 12537.516455913343,
"output_tput_per_gpu": 6267.345326208275,
"input_tput_per_gpu": 6270.171129705068,
"mean_ttft": 0.12325886216131039,
"median_ttft": 0.09574346081353724,
"std_ttft": 0.07585378064376959,
"p99_ttft": 0.37227993815205995,
"mean_tpot": 0.004800954248577439,
"mean_intvty": 208.2919245265268,
"median_tpot": 0.004615029072880875,
"median_intvty": 216.6833586978386,
"std_tpot": 0.0003876639129895024,
"std_intvty": 2579.554006686919,
"p99_tpot": 0.005483307060496561,
"p99_intvty": 182.37169448421176,
"mean_itl": 0.0950130229916803,
"median_itl": 0.10091885924339294,
"std_itl": 0.014870205574147423,
"p99_itl": 0.1373905152920634,
"mean_e2el": 4.546560322691948,
"median_e2el": 4.4813417724799365,
"std_e2el": 0.5030727001581198,
"p99_e2el": 5.779500857936219
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1408.4683714512335,
"output_tput_per_gpu": 706.8446971127147,
"input_tput_per_gpu": 701.6236743385188,
"mean_ttft": 0.045559974263596814,
"median_ttft": 0.0384362754994072,
"std_ttft": 0.016887328791158073,
"p99_ttft": 0.09552384694950887,
"mean_tpot": 0.0027252983164160484,
"mean_intvty": 366.9323075482862,
"median_tpot": 0.002724332720574693,
"median_intvty": 367.0623607930869,
"std_tpot": 6.349389603754662e-05,
"std_intvty": 15749.545427306235,
"p99_tpot": 0.0028320918584187472,
"p99_intvty": 353.0958916559768,
"mean_itl": 0.05387604231032231,
"median_itl": 0.05158175399992615,
"std_itl": 0.010134579097903522,
"p99_itl": 0.09527979360660538,
"mean_e2el": 2.5757134227871576,
"median_e2el": 2.5781639920023736,
"std_e2el": 0.18504631498627774,
"p99_e2el": 2.8908089425152865
},
{
"hw": "h200",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 2,
"dp_attention": "false",
"tput_per_gpu": 2785.955442372296,
"output_tput_per_gpu": 1395.1843877217507,
"input_tput_per_gpu": 1390.7710546505455,
"mean_ttft": 0.13324617224570828,
"median_ttft": 0.06988735590130091,
"std_ttft": 0.1621246097845053,
"p99_ttft": 0.691197894057259,
"mean_tpot": 0.011078252126514136,
"mean_intvty": 90.26694722055024,
"median_tpot": 0.01116804223990596,
"median_intvty": 89.54120861280164,
"std_tpot": 0.0005000206172366596,
"std_intvty": 1999.917534453785,
"p99_tpot": 0.011748584943575583,
"p99_intvty": 85.11663360333661,
"mean_itl": 0.21916313203685792,
"median_itl": 0.20198562927544117,
"std_itl": 0.046598903094229656,
"p99_itl": 0.3757334847934547,
"mean_e2el": 10.359260784473737,
"median_e2el": 10.358484098687768,
"std_e2el": 0.8171457530619717,
"p99_e2el": 11.954826805526391
},
{
"hw": "b200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 643.3174022028867,
"output_tput_per_gpu": 320.09350498754037,
"input_tput_per_gpu": 323.2238972153463,
"mean_ttft": 0.02565773722017184,
"median_ttft": 0.024550606729462743,
"std_ttft": 0.005461557216335842,
"p99_ttft": 0.04841612218413502,
"mean_tpot": 0.003000171409814411,
"mean_intvty": 333.31428888653386,
"median_tpot": 0.0030065527397333397,
"median_intvty": 332.6068379857168,
"std_tpot": 2.9378576133107092e-05,
"std_intvty": 34038.40933165877,
"p99_tpot": 0.0030307327513409764,
"p99_intvty": 329.95321001415925,
"mean_itl": 0.0030000449905005938,
"median_itl": 0.002977055963128805,
"std_itl": 0.0005487613259344647,
"p99_itl": 0.0031792698986828328,
"mean_e2el": 2.7758738299366086,
"median_e2el": 2.793255462544039,
"std_e2el": 0.18914397807173883,
"p99_e2el": 3.095658089867793
},
{
"hw": "b200-trt",
"conc": 64,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "true",
"tput_per_gpu": 7612.714775612072,
"output_tput_per_gpu": 3805.499481213903,
"input_tput_per_gpu": 3807.215294398169,
"mean_ttft": 0.12797423539304872,
"median_ttft": 0.12351650791242719,
"std_ttft": 0.05598623308025791,
"p99_ttft": 0.24748011482413865,
"mean_tpot": 0.003910473678685385,
"mean_intvty": 255.7234959669075,
"median_tpot": 0.004003324604530843,
"median_intvty": 249.79238477645052,
"std_tpot": 0.00019934632395379238,
"std_intvty": 5016.395487843537,
"p99_tpot": 0.004129296670464041,
"p99_intvty": 242.1719919406087,
"mean_itl": 0.07736544630362556,
"median_itl": 0.07658705487847328,
"std_itl": 0.011334932667660782,
"p99_itl": 0.10595035917125656,
"mean_e2el": 3.7296982907209895,
"median_e2el": 3.741765610408038,
"std_e2el": 0.30360766250813104,
"p99_e2el": 4.2972311748564245
},
{
"hw": "mi300x",
"conc": 8,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 641.4741444164059,
"output_tput_per_gpu": 321.9260059410936,
"input_tput_per_gpu": 319.5481384753123,
"mean_ttft": 0.2123976057744585,
"median_ttft": 0.04356789821758866,
"std_ttft": 0.4801158397858313,
"p99_ttft": 1.7737721646390856,
"mean_tpot": 0.0058618627837026946,
"mean_intvty": 170.5942354673034,
"median_tpot": 0.005847554141175337,
"median_intvty": 171.01167015428499,
"std_tpot": 0.00024562736565946885,
"std_intvty": 4071.2076087905166,
"p99_tpot": 0.006472033039158277,
"p99_intvty": 154.5109541236297,
"mean_itl": 0.0058641250875119426,
"median_itl": 0.0056395637802779675,
"std_itl": 0.00548106223966618,
"p99_itl": 0.0064249765593558565,
"mean_e2el": 5.6547453426057475,
"median_e2el": 5.506821861490607,
"std_e2el": 0.7564983225568345,
"p99_e2el": 7.818446509558706
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 5849.077763522226,
"output_tput_per_gpu": 2929.17174988549,
"input_tput_per_gpu": 2919.9060136367357,
"mean_ttft": 0.06963956666604645,
"median_ttft": 0.04219375547836535,
"std_ttft": 0.07028351342871317,
"p99_ttft": 0.2960839060720173,
"mean_tpot": 0.0052669125288323515,
"mean_intvty": 189.86455433344648,
"median_tpot": 0.0053366052016970665,
"median_intvty": 187.38504390056718,
"std_tpot": 0.00025072314900673377,
"std_intvty": 3988.4629878078895,
"p99_tpot": 0.005587742635122765,
"p99_intvty": 178.9631458174755,
"mean_itl": 0.10420796838967435,
"median_itl": 0.09472633001860231,
"std_itl": 0.022929823621085102,
"p99_itl": 0.17785496839496773,
"mean_e2el": 4.931918038630829,
"median_e2el": 4.9333686440077145,
"std_e2el": 0.40404016770375345,
"p99_e2el": 5.663353297397261
},
{
"hw": "h200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 270.97551089646316,
"output_tput_per_gpu": 134.82846997706963,
"input_tput_per_gpu": 136.14704091939353,
"mean_ttft": 0.03377551632001996,
"median_ttft": 0.030460181646049023,
"std_ttft": 0.012183262502175864,
"p99_ttft": 0.07653950670734049,
"mean_tpot": 0.003565823417198406,
"mean_intvty": 280.44013485829856,
"median_tpot": 0.0035723108106155803,
"median_intvty": 279.93084952976983,
"std_tpot": 5.8137810680626486e-05,
"std_intvty": 17200.510103371234,
"p99_tpot": 0.0036245316096350953,
"p99_intvty": 275.8977180228472,
"mean_itl": 0.003566064566943555,
"median_itl": 0.0035360194742679596,
"std_itl": 0.0007245493071123533,
"p99_itl": 0.003764563575387001,
"mean_e2el": 3.302875915914774,
"median_e2el": 3.3218971313908696,
"std_e2el": 0.23120413818230282,
"p99_e2el": 3.6808650334738195
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "true",
"tput_per_gpu": 4328.720901711833,
"output_tput_per_gpu": 2167.7890927539493,
"input_tput_per_gpu": 2160.9318089578837,
"mean_ttft": 0.1176058187149465,
"median_ttft": 0.12368138250894845,
"std_ttft": 0.04388758186887595,
"p99_ttft": 0.21955395505763592,
"mean_tpot": 0.00345503620764183,
"mean_intvty": 289.4325673890785,
"median_tpot": 0.0034928517477451986,
"median_intvty": 286.2990107282817,
"std_tpot": 0.00010172794127082154,
"std_intvty": 9830.140937756581,
"p99_tpot": 0.0035610166412153744,
"p99_intvty": 280.8186820656643,
"mean_itl": 0.06832996811049712,
"median_itl": 0.06638291524723172,
"std_itl": 0.009905570138439575,
"p99_itl": 0.09154982781037689,
"mean_e2el": 3.305839276514598,
"median_e2el": 3.296166477026418,
"std_e2el": 0.23461558200166507,
"p99_e2el": 3.7513617239799357
},
{
"hw": "mi300x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 327.3836933987623,
"output_tput_per_gpu": 162.8953196927927,
"input_tput_per_gpu": 164.48837370596962,
"mean_ttft": 0.05454268135363236,
"median_ttft": 0.0470407351385802,
"std_ttft": 0.026273024347090817,
"p99_ttft": 0.14682704955805093,
"mean_tpot": 0.0058937068786321465,
"mean_intvty": 169.67250333156832,
"median_tpot": 0.005911423858758415,
"median_intvty": 169.163982129008,
"std_tpot": 6.99535659833834e-05,
"std_intvty": 14295.196905866638,
"p99_tpot": 0.005982048741631711,
"p99_intvty": 167.1668090967831,
"mean_itl": 0.005893373568116503,
"median_itl": 0.0058342632837593555,
"std_itl": 0.0013949341718442275,
"p99_itl": 0.006104584354907273,
"mean_e2el": 5.457145299669355,
"median_e2el": 5.496456290129572,
"std_e2el": 0.37427952948479143,
"p99_e2el": 6.121350273909047
},
{
"hw": "mi325x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1610.7457443710637,
"output_tput_per_gpu": 805.1913509499309,
"input_tput_per_gpu": 805.5543934211328,
"mean_ttft": 0.18940948997005763,
"median_ttft": 0.07698445697315037,
"std_ttft": 0.3214821230961111,
"p99_ttft": 1.303624804179417,
"mean_tpot": 0.009474149229404141,
"mean_intvty": 105.5503745810106,
"median_tpot": 0.009299655582363349,
"median_intvty": 107.53086403506,
"std_tpot": 0.0009244011600921652,
"std_intvty": 1081.7814204174056,
"p99_tpot": 0.012504634476338186,
"p99_intvty": 79.97035034428583,
"mean_itl": 0.009488971538109766,
"median_itl": 0.007856787997297943,
"std_itl": 0.01661203188054101,
"p99_itl": 0.045405082241632054,
"mean_e2el": 8.92698760992189,
"median_e2el": 8.72654900350608,
"std_e2el": 1.3465847086287548,
"p99_e2el": 13.447103659149727
},
{
"hw": "h200",
"conc": 8,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 484.28696230872856,
"output_tput_per_gpu": 243.041077902261,
"input_tput_per_gpu": 241.24588440646755,
"mean_ttft": 0.03807279649190605,
"median_ttft": 0.031890262849628925,
"std_ttft": 0.022349892816795495,
"p99_ttft": 0.11070307087153196,
"mean_tpot": 0.0039898566326489904,
"mean_intvty": 250.63557216993752,
"median_tpot": 0.0039927702039781535,
"median_intvty": 250.45268044819127,
"std_tpot": 6.208622362401652e-05,
"std_intvty": 16106.632705764612,
"p99_tpot": 0.0040820092093729445,
"p99_intvty": 244.9773992924466,
"mean_itl": 0.003991253700427667,
"median_itl": 0.003931056708097458,
"std_itl": 0.0010720186325185591,
"p99_itl": 0.00445046573877334,
"mean_e2el": 3.7415070842020213,
"median_e2el": 3.7334742741659284,
"std_e2el": 0.2495803809423739,
"p99_e2el": 4.17223551934585
},
{
"hw": "h100",
"conc": 8,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1225.5562093155875,
"output_tput_per_gpu": 615.049599357139,
"input_tput_per_gpu": 610.5066099584485,
"mean_ttft": 0.07248451577033847,
"median_ttft": 0.05646981671452522,
"std_ttft": 0.051172286002391404,
"p99_ttft": 0.2379119606129825,
"mean_tpot": 0.006305936140040852,
"mean_intvty": 158.58073691078033,
"median_tpot": 0.006282765473695805,
"median_intvty": 159.16557830890272,
"std_tpot": 0.00019733012250243534,
"std_intvty": 5067.650023820659,
"p99_tpot": 0.0067830572310691815,
"p99_intvty": 147.42614811203276,
"mean_itl": 0.006313086282829815,
"median_itl": 0.0060897208750247955,
"std_itl": 0.0027506527690082634,
"p99_itl": 0.010151445493102073,
"mean_e2el": 5.925109858973883,
"median_e2el": 5.922360653989017,
"std_e2el": 0.4195272324406317,
"p99_e2el": 6.709199985582381
},
{
"hw": "h200",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 2,
"dp_attention": "false",
"tput_per_gpu": 1129.504311000658,
"output_tput_per_gpu": 566.8456237850341,
"input_tput_per_gpu": 562.6586872156238,
"mean_ttft": 0.08429896054731216,
"median_ttft": 0.0627355370670557,
"std_ttft": 0.05578428186347724,
"p99_ttft": 0.25923565020551903,
"mean_tpot": 0.00683566148816592,
"mean_intvty": 146.29162104226882,
"median_tpot": 0.0068767162091288355,
"median_intvty": 145.41824463724427,
"std_tpot": 0.00018714554824602274,
"std_intvty": 5343.434612109466,
"p99_tpot": 0.007069013219528471,
"p99_intvty": 141.46246002729978,
"mean_itl": 0.13514088789534814,
"median_itl": 0.13187522604130208,
"std_itl": 0.021617930605885217,
"p99_itl": 0.21439766180701553,
"mean_e2el": 6.4308527613291515,
"median_e2el": 6.5164619439747185,
"std_e2el": 0.4763407096403654,
"p99_e2el": 7.2440110907657065
},
{
"hw": "b200-trt",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1432.5117330308267,
"output_tput_per_gpu": 712.7705546149696,
"input_tput_per_gpu": 719.7411784158571,
"mean_ttft": 0.04591177837210125,
"median_ttft": 0.03865352150751278,
"std_ttft": 0.022574162790805322,
"p99_ttft": 0.11436678228084929,
"mean_tpot": 0.002678250377276791,
"mean_intvty": 373.37808611336277,
"median_tpot": 0.002694954971570956,
"median_intvty": 371.06371369799746,
"std_tpot": 5.949607522239858e-05,
"std_intvty": 16807.8313781533,
"p99_tpot": 0.002743099440455185,
"p99_intvty": 364.5511297374118,
"mean_itl": 0.05288149878945263,
"median_itl": 0.05249050000566058,
"std_itl": 0.007516882988786539,
"p99_itl": 0.07789951339480469,
"mean_e2el": 2.500935184398986,
"median_e2el": 2.497745867498452,
"std_e2el": 0.17627860134736137,
"p99_e2el": 2.843540700580925
},
{
"hw": "h100",
"conc": 8,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 487.0078866714236,
"output_tput_per_gpu": 244.4065831532125,
"input_tput_per_gpu": 242.6013035182111,
"mean_ttft": 0.04071565293706954,
"median_ttft": 0.032012869603931904,
"std_ttft": 0.026045159643828253,
"p99_ttft": 0.12479181338101625,
"mean_tpot": 0.003963608322711201,
"mean_intvty": 252.29536285663477,
"median_tpot": 0.003975176879528844,
"median_intvty": 251.5611330780643,
"std_tpot": 7.159638225143532e-05,
"std_intvty": 13967.186169940209,
"p99_tpot": 0.004062423637343854,
"p99_intvty": 246.1584731851927,
"mean_itl": 0.003979784043771736,
"median_itl": 0.0038815736770629883,
"std_itl": 0.0012129149653554944,
"p99_itl": 0.007680247742682699,
"mean_e2el": 3.7200257753720507,
"median_e2el": 3.768243782222271,
"std_e2el": 0.2536902766644461,
"p99_e2el": 4.152561255693436
},
{
"hw": "mi300x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 3563.8847081956706,
"output_tput_per_gpu": 1781.5407259960498,
"input_tput_per_gpu": 1782.3439821996208,
"mean_ttft": 0.20441106481011956,
"median_ttft": 0.11913519306108356,
"std_ttft": 0.2855072038131312,
"p99_ttft": 1.5041818937566132,
"mean_tpot": 0.01736557048086549,
"mean_intvty": 57.58520868069752,
"median_tpot": 0.017604668782901782,
"median_intvty": 56.80311355651473,
"std_tpot": 0.000906927914598121,
"std_intvty": 1102.623465331444,
"p99_tpot": 0.01842713565579664,
"p99_intvty": 54.26779390346699,
"mean_itl": 0.017374540464317027,
"median_itl": 0.015168017707765102,
"std_itl": 0.010460472901262995,
"p99_itl": 0.07333389464765787,
"mean_e2el": 16.203131922744795,
"median_e2el": 16.312475997023284,
"std_e2el": 1.3833834823962003,
"p99_e2el": 18.74021595238708
},
{
"hw": "mi325x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1666.2747075163102,
"output_tput_per_gpu": 834.457156860629,
"input_tput_per_gpu": 831.8175506556812,
"mean_ttft": 0.08844745619571767,
"median_ttft": 0.05288012698292732,
"std_ttft": 0.10615305689537698,
"p99_ttft": 0.5565663786930966,
"mean_tpot": 0.009260872810463422,
"mean_intvty": 107.98118281790323,
"median_tpot": 0.009354449174353456,
"median_intvty": 106.90100308007885,
"std_tpot": 0.0002862674970901593,
"std_intvty": 3493.2362568742906,
"p99_tpot": 0.00955624386867804,
"p99_intvty": 104.64362502066774,
"mean_itl": 0.009262604046332239,
"median_itl": 0.008733599912375212,
"std_itl": 0.0038922137852916107,
"p99_itl": 0.026930421963334086,
"mean_e2el": 8.634965246135835,
"median_e2el": 8.618942907080054,
"std_e2el": 0.62789533776747,
"p99_e2el": 9.866723466771655
},
{
"hw": "h100",
"conc": 8,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 807.1754337051088,
"output_tput_per_gpu": 405.0837679558545,
"input_tput_per_gpu": 402.09166574925433,
"mean_ttft": 0.04900623455177992,
"median_ttft": 0.03959863539785147,
"std_ttft": 0.02978421376812613,
"p99_ttft": 0.14521487772464753,
"mean_tpot": 0.004790165844969869,
"mean_intvty": 208.76103925505947,
"median_tpot": 0.004785506450160575,
"median_intvty": 208.96429885000902,
"std_tpot": 0.00011331951527073712,
"std_intvty": 8824.605343667874,
"p99_tpot": 0.005039482674529125,
"p99_intvty": 198.43306636497908,
"mean_itl": 0.004791798210578163,
"median_itl": 0.0046480754390358925,
"std_itl": 0.0016880632000589682,
"p99_itl": 0.005124961566179992,
"mean_e2el": 4.495675040828064,
"median_e2el": 4.515464562922716,
"std_e2el": 0.3171558840135282,
"p99_e2el": 5.0403249395079905
},
{
"hw": "mi325x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 217.1478869035566,
"output_tput_per_gpu": 108.04562099763693,
"input_tput_per_gpu": 109.10226590591968,
"mean_ttft": 0.04031090112403035,
"median_ttft": 0.03278668178245425,
"std_ttft": 0.023337504991686603,
"p99_ttft": 0.12208298887591809,
"mean_tpot": 0.004440737324110528,
"mean_intvty": 225.18782963599372,
"median_tpot": 0.004448293502538789,
"median_intvty": 224.80531004288875,
"std_tpot": 3.808460536280071e-05,
"std_intvty": 26257.328662692515,
"p99_tpot": 0.004501803062485814,
"p99_intvty": 222.13321776182235,
"mean_itl": 0.004443009005582992,
"median_itl": 0.0043966323137283325,
"std_itl": 0.0012155162726506339,
"p99_itl": 0.007641348000615836,
"mean_e2el": 4.111550969409291,
"median_e2el": 4.145459256367758,
"std_e2el": 0.2818572640466106,
"p99_e2el": 4.589999017585069
},
{
"hw": "mi300x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 628.3694371297934,
"output_tput_per_gpu": 312.65589096327557,
"input_tput_per_gpu": 315.7135461665178,
"mean_ttft": 0.07366733320523053,
"median_ttft": 0.06391416257247329,
"std_ttft": 0.03385224682391325,
"p99_ttft": 0.19282620271667839,
"mean_tpot": 0.006142378390202072,
"mean_intvty": 162.80338599704896,
"median_tpot": 0.006160081376515494,
"median_intvty": 162.3355178086396,
"std_tpot": 0.0001965902237022331,
"std_intvty": 5086.722936511114,
"p99_tpot": 0.006384717562035095,
"p99_intvty": 156.623999461811,
"mean_itl": 0.006142271271958326,
"median_itl": 0.0060949428007006645,
"std_itl": 0.0019774053993011874,
"p99_itl": 0.006425726227462292,
"mean_e2el": 5.70444061064627,
"median_e2el": 5.685530324932188,
"std_e2el": 0.4251781550493429,
"p99_e2el": 6.438216315452009
},
{
"hw": "h100",
"conc": 32,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2770.183762095209,
"output_tput_per_gpu": 1387.2860553378007,
"input_tput_per_gpu": 1382.8977067574085,
"mean_ttft": 0.1244314091047272,
"median_ttft": 0.07169304881244898,
"std_ttft": 0.16610501396205113,
"p99_ttft": 0.8645944510400295,
"mean_tpot": 0.011158861462984584,
"mean_intvty": 89.61487722713755,
"median_tpot": 0.011189171830358535,
"median_intvty": 89.37211932761576,
"std_tpot": 0.0004264202020017081,
"std_intvty": 2345.1046533578497,
"p99_tpot": 0.01176135300287474,
"p99_intvty": 85.02423146006905,
"mean_itl": 0.011164684117460893,
"median_itl": 0.010258030146360397,
"std_itl": 0.0059684269270748914,
"p99_itl": 0.04300737980753183,
"mean_e2el": 10.423817431920906,
"median_e2el": 10.444497598335147,
"std_e2el": 0.8144450453423778,
"p99_e2el": 12.096689118146896
},
{
"hw": "mi325x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2649.910881770307,
"output_tput_per_gpu": 1324.6568120673078,
"input_tput_per_gpu": 1325.254069702999,
"mean_ttft": 0.29322351307309874,
"median_ttft": 0.08139939996181056,
"std_ttft": 0.6344666849756011,
"p99_ttft": 2.8963434792414775,
"mean_tpot": 0.011426097025583387,
"mean_intvty": 87.5189487504761,
"median_tpot": 0.011332482767319548,
"median_intvty": 88.24191666840966,
"std_tpot": 0.00077048641504342,
"std_intvty": 1297.8814168237425,
"p99_tpot": 0.014265796017978843,
"p99_intvty": 70.09773578282795,
"mean_itl": 0.011439270977507972,
"median_itl": 0.00985207105986774,
"std_itl": 0.01372432957188017,
"p99_itl": 0.044184574973769486,
"mean_e2el": 10.826664753807382,
"median_e2el": 10.602753890503664,
"std_e2el": 1.4726682556735442,
"p99_e2el": 15.860737412628952
},
{
"hw": "mi355x",
"conc": 64,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 10506.257890206105,
"output_tput_per_gpu": 5251.944953824208,
"input_tput_per_gpu": 5254.312936381897,
"mean_ttft": 0.15165661979185643,
"median_ttft": 0.06204623499070294,
"std_ttft": 0.2769905090983426,
"p99_ttft": 1.2424272003691295,
"mean_tpot": 0.011760554450672148,
"mean_intvty": 85.03000468169655,
"median_tpot": 0.011874210380250808,
"median_intvty": 84.2161262077014,
"std_tpot": 0.0004689982951955104,
"std_intvty": 2132.2039125603474,
"p99_tpot": 0.01225772152509892,
"p99_intvty": 81.58123008035378,
"mean_itl": 0.012027872219048674,
"median_itl": 0.010571181017439812,
"std_itl": 0.0077798866383765606,
"p99_itl": 0.03963335721811745,
"mean_e2el": 10.985743598067028,
"median_e2el": 11.015348332497524,
"std_e2el": 0.8893576564404421,
"p99_e2el": 12.85538099929225
},
{
"hw": "b200",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 11653.485132246493,
"output_tput_per_gpu": 5825.429289321085,
"input_tput_per_gpu": 5828.055842925408,
"mean_ttft": 0.09958614140668942,
"median_ttft": 0.06319850450381637,
"std_ttft": 0.11122309801412995,
"p99_ttft": 0.5699842003895901,
"mean_tpot": 0.010631311285698908,
"mean_intvty": 94.06177404900053,
"median_tpot": 0.010812397328069177,
"median_intvty": 92.4864273535326,
"std_tpot": 0.0005028575259438399,
"std_intvty": 1988.6348486542925,
"p99_tpot": 0.011189621088878849,
"p99_intvty": 89.3685310750943,
"mean_itl": 0.010636126677587551,
"median_itl": 0.009442799026146531,
"std_itl": 0.005470121624122207,
"p99_itl": 0.035563327837735435,
"mean_e2el": 9.893481029837494,
"median_e2el": 9.976324305986054,
"std_e2el": 0.8021635531722439,
"p99_e2el": 11.306488423643168
},
{
"hw": "b200-trt",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2389.8054261165166,
"output_tput_per_gpu": 1199.3320736348096,
"input_tput_per_gpu": 1190.473352481707,
"mean_ttft": 0.0434009378139308,
"median_ttft": 0.03640865250781644,
"std_ttft": 0.016388668996985135,
"p99_ttft": 0.09326244384603342,
"mean_tpot": 0.0032254584654563727,
"mean_intvty": 310.033445077554,
"median_tpot": 0.0032343376938452255,
"median_intvty": 309.18231015362045,
"std_tpot": 7.819404349307855e-05,
"std_intvty": 12788.697902398617,
"p99_tpot": 0.0033530904160047636,
"p99_intvty": 298.2323397027596,
"mean_itl": 0.06376451069921782,
"median_itl": 0.06196074100444093,
"std_itl": 0.010055983095694394,
"p99_itl": 0.1013346934411675,
"mean_e2el": 3.0379415896633875,
"median_e2el": 3.0629716984840343,
"std_e2el": 0.2181152131818874,
"p99_e2el": 3.4504525485410698
},
{
"hw": "mi325x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1072.3394027747013,
"output_tput_per_gpu": 533.5606914954052,
"input_tput_per_gpu": 538.7787112792961,
"mean_ttft": 0.08493512502172962,
"median_ttft": 0.0757769278716296,
"std_ttft": 0.03446693739312833,
"p99_ttft": 0.2058377569494769,
"mean_tpot": 0.007211824194592526,
"mean_intvty": 138.6611726821914,
"median_tpot": 0.007289655643977763,
"median_intvty": 137.1806912204604,
"std_tpot": 0.0002554563940438941,
"std_intvty": 3914.5624197144734,
"p99_tpot": 0.007441657801159475,
"p99_intvty": 134.37865952989551,
"mean_itl": 0.007211532874152093,
"median_itl": 0.00716329924762249,
"std_itl": 0.002794335546985931,
"p99_itl": 0.00990391324274241,
"mean_e2el": 6.694845613953658,
"median_e2el": 6.60554672498256,
"std_e2el": 0.49833873557561187,
"p99_e2el": 7.576631340244785
},
{
"hw": "mi325x",
"conc": 4,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 342.74363396318375,
"output_tput_per_gpu": 170.53791912322964,
"input_tput_per_gpu": 172.2057148399541,
"mean_ttft": 0.04665343286469579,
"median_ttft": 0.04122628248296678,
"std_ttft": 0.01650410895390058,
"p99_ttft": 0.09655181460548193,
"mean_tpot": 0.005631899931624608,
"mean_intvty": 177.55997303587293,
"median_tpot": 0.005641750038567823,
"median_intvty": 177.2499655539247,
"std_tpot": 4.117394450099675e-05,
"std_intvty": 24287.20425306329,
"p99_tpot": 0.005677929035184505,
"p99_intvty": 176.1205527232351,
"mean_itl": 0.005631921867332767,
"median_itl": 0.005578185897320509,
"std_itl": 0.001267512618217796,
"p99_itl": 0.007979394905269145,
"mean_e2el": 5.209576830267906,
"median_e2el": 5.2558167623355985,
"std_e2el": 0.355036476490007,
"p99_e2el": 5.8014708161354065
},
{
"hw": "h100",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1861.003172925519,
"output_tput_per_gpu": 925.5750560673633,
"input_tput_per_gpu": 935.4281168581557,
"mean_ttft": 0.11339723569108173,
"median_ttft": 0.06382951699197292,
"std_ttft": 0.15075494972624012,
"p99_ttft": 0.654822179172188,
"mean_tpot": 0.008355634718917844,
"mean_intvty": 119.67971717766906,
"median_tpot": 0.008340705188765632,
"median_intvty": 119.89393910564452,
"std_tpot": 0.0002829426455441548,
"std_intvty": 3534.2851837580074,
"p99_tpot": 0.008828514731589777,
"p99_intvty": 113.26933582857906,
"mean_itl": 0.008358601064479776,
"median_itl": 0.007914695888757706,
"std_itl": 0.004174217500025567,
"p99_itl": 0.03841907516121863,
"mean_e2el": 7.760628976405132,
"median_e2el": 7.721256039105356,
"std_e2el": 0.6091390181945031,
"p99_e2el": 8.866010247692465
},
{
"hw": "h200",
"conc": 16,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 2,
"dp_attention": "false",
"tput_per_gpu": 1791.290770453509,
"output_tput_per_gpu": 890.9034006046868,
"input_tput_per_gpu": 900.3873698488221,
"mean_ttft": 0.09911735262721777,
"median_ttft": 0.06415218708571047,
"std_ttft": 0.09311570792676002,
"p99_ttft": 0.39069364386377853,
"mean_tpot": 0.008701100612462393,
"mean_intvty": 114.9279895198226,
"median_tpot": 0.008816341152648326,
"median_intvty": 113.42573780730022,
"std_tpot": 0.00031992534725647334,
"std_intvty": 3125.7292008136315,
"p99_tpot": 0.009036620044285685,
"p99_intvty": 110.66084388845704,
"mean_itl": 0.17213631675371832,
"median_itl": 0.16463640006259084,
"std_itl": 0.03081122631399133,
"p99_itl": 0.28091273619327584,
"mean_e2el": 8.063649420109869,
"median_e2el": 8.0557233504951,
"std_e2el": 0.6412331781427206,
"p99_e2el": 9.198544655470178
},
{
"hw": "b200",
"conc": 128,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 17224.77156686382,
"output_tput_per_gpu": 8602.9361498889,
"input_tput_per_gpu": 8621.83541697492,
"mean_ttft": 0.13488439001500865,
"median_ttft": 0.07648193099885248,
"std_ttft": 0.19623957985757726,
"p99_ttft": 1.0032230378806708,
"mean_tpot": 0.014381313988478871,
"mean_intvty": 69.53467539900164,
"median_tpot": 0.014581028949422505,
"median_intvty": 68.5822655910443,
"std_tpot": 0.0006893982417888165,
"std_intvty": 1450.5403979639536,
"p99_tpot": 0.015038341194932435,
"p99_intvty": 66.49669581489322,
"mean_itl": 0.014387845650380085,
"median_itl": 0.012503028003266081,
"std_itl": 0.006257223930372796,
"p99_itl": 0.03985194519045764,
"mean_e2el": 13.357325679814494,
"median_e2el": 13.344436524494085,
"std_e2el": 1.129824058186566,
"p99_e2el": 15.332616392736382
},
{
"hw": "h200",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 3196.0590044751334,
"output_tput_per_gpu": 1597.6693258524524,
"input_tput_per_gpu": 1598.389678622681,
"mean_ttft": 0.11114808736856503,
"median_ttft": 0.058592012617737055,
"std_ttft": 0.16733693597102545,
"p99_ttft": 0.9136731705628336,
"mean_tpot": 0.009658206898325385,
"mean_intvty": 103.5388877591127,
"median_tpot": 0.009791393142916755,
"median_intvty": 102.13051252297181,
"std_tpot": 0.00043626521740452464,
"std_intvty": 2292.1836536712835,
"p99_tpot": 0.01012631273912882,
"p99_intvty": 98.75262849980192,
"mean_itl": 0.009663321167742924,
"median_itl": 0.00852869008667767,
"std_itl": 0.005253605060538153,
"p99_itl": 0.02918123439885676,
"mean_e2el": 9.009269993812268,
"median_e2el": 9.059704059036449,
"std_e2el": 0.735103066390607,
"p99_e2el": 10.346231547296048
},
{
"hw": "b200",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 5676.317225017749,
"output_tput_per_gpu": 2837.5189261276137,
"input_tput_per_gpu": 2838.798298890135,
"mean_ttft": 0.05671074498532107,
"median_ttft": 0.0337088837986812,
"std_ttft": 0.06818347805604792,
"p99_ttft": 0.36667685975087805,
"mean_tpot": 0.0054086969923076,
"mean_intvty": 184.8874139967959,
"median_tpot": 0.005448422633714218,
"median_intvty": 183.53935941241306,
"std_tpot": 0.00016600389512593056,
"std_intvty": 6023.955035762263,
"p99_tpot": 0.00560082291549364,
"p99_intvty": 178.54519149921435,
"mean_itl": 0.005412753073491996,
"median_itl": 0.0049374280497431755,
"std_itl": 0.002298693595940763,
"p99_itl": 0.014476373391225926,
"mean_e2el": 5.039225928995438,
"median_e2el": 5.051926935440861,
"std_e2el": 0.3760421433577673,
"p99_e2el": 5.755151600025129
},
{
"hw": "b200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 327.7988231162002,
"output_tput_per_gpu": 163.1018745377637,
"input_tput_per_gpu": 164.6969485784365,
"mean_ttft": 0.027258147550310242,
"median_ttft": 0.024898714502342045,
"std_ttft": 0.008294073453571168,
"p99_ttft": 0.05238244499807479,
"mean_tpot": 0.0029406452336997084,
"mean_intvty": 340.0614220784028,
"median_tpot": 0.002941120750064281,
"median_intvty": 340.0064414146016,
"std_tpot": 2.4468635702046135e-05,
"std_intvty": 40868.64556638837,
"p99_tpot": 0.0029758547764733288,
"p99_intvty": 336.0379034305885,
"mean_itl": 0.0029409369600438087,
"median_itl": 0.0029192439978942275,
"std_itl": 0.0004421252112433081,
"p99_itl": 0.0032442752132192253,
"mean_e2el": 2.722994352800015,
"median_e2el": 2.7383139860030496,
"std_e2el": 0.18484253904000575,
"p99_e2el": 3.04068772976665
},
{
"hw": "mi325x",
"conc": 32,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 2777.584226224664,
"output_tput_per_gpu": 1390.99214907436,
"input_tput_per_gpu": 1386.5920771503038,
"mean_ttft": 0.12571723038417987,
"median_ttft": 0.0697406125254929,
"std_ttft": 0.17583788049353996,
"p99_ttft": 0.9101231224881488,
"mean_tpot": 0.011124067698355864,
"mean_intvty": 89.89517388030639,
"median_tpot": 0.011224127267625294,
"median_intvty": 89.09378663982055,
"std_tpot": 0.000483000578644464,
"std_intvty": 2070.390894368055,
"p99_tpot": 0.011735533275590675,
"p99_intvty": 85.2112960286134,
"mean_itl": 0.01112673224526257,
"median_itl": 0.010254648048430681,
"std_itl": 0.006175799019314619,
"p99_itl": 0.043536216020584106,
"mean_e2el": 10.392248600900349,
"median_e2el": 10.359675646061078,
"std_e2el": 0.8059931903985361,
"p99_e2el": 11.952914400771261
},
{
"hw": "h200",
"conc": 8,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 8,
"dp_attention": "false",
"tput_per_gpu": 347.57549500987017,
"output_tput_per_gpu": 174.4319577733308,
"input_tput_per_gpu": 173.14353723653937,
"mean_ttft": 0.11529489612439647,
"median_ttft": 0.07449254300445318,
"std_ttft": 0.10190156680614862,
"p99_ttft": 0.41927989532239734,
"mean_tpot": 0.005505458698979396,
"mean_intvty": 181.6379078795706,
"median_tpot": 0.0055285483944889705,
"median_intvty": 180.87930658196484,
"std_tpot": 0.00016464243088232773,
"std_intvty": 6073.768436489584,
"p99_tpot": 0.005823102534399689,
"p99_intvty": 171.72975988187565,
"mean_itl": 0.10885950128032484,
"median_itl": 0.10277893580496311,
"std_itl": 0.02493709472852975,
"p99_itl": 0.19854081451892863,
"mean_e2el": 5.227609006525017,
"median_e2el": 5.2803558823652565,
"std_e2el": 0.4065552314431933,
"p99_e2el": 6.06891579406336
},
{
"hw": "h200",
"conc": 16,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 828.3721899552592,
"output_tput_per_gpu": 411.9931912621028,
"input_tput_per_gpu": 416.37899869315635,
"mean_ttft": 0.04797198505839333,
"median_ttft": 0.03515802789479494,
"std_ttft": 0.03960074159685243,
"p99_ttft": 0.19349748292937866,
"mean_tpot": 0.004699177532640539,
"mean_intvty": 212.80319652832625,
"median_tpot": 0.004722395820005746,
"median_intvty": 211.75692129906705,
"std_tpot": 8.819205887890343e-05,
"std_intvty": 11338.889382014548,
"p99_tpot": 0.004801976746095727,
"p99_intvty": 208.24757238007354,
"mean_itl": 0.004700448401512026,
"median_itl": 0.004526201635599136,
"std_itl": 0.001635585711339195,
"p99_itl": 0.016565896421670844,
"mean_e2el": 4.348382676072651,
"median_e2el": 4.334473461844027,
"std_e2el": 0.3128613121535538,
"p99_e2el": 4.875354690309614
},
{
"hw": "h100",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 739.6644389770205,
"output_tput_per_gpu": 368.03261030412517,
"input_tput_per_gpu": 371.6318286728953,
"mean_ttft": 0.0577410124707967,
"median_ttft": 0.052982146851718426,
"std_ttft": 0.019607974648148317,
"p99_ttft": 0.14059084724634885,
"mean_tpot": 0.005228722087192241,
"mean_intvty": 191.2513197917902,
"median_tpot": 0.005242850075618807,
"median_intvty": 190.73595193010956,
"std_tpot": 0.00012199792399776952,
"std_intvty": 8196.860792633512,
"p99_tpot": 0.005346744913222304,
"p99_intvty": 187.0296818400737,
"mean_itl": 0.005228417567855326,
"median_itl": 0.0051577128469944,
"std_itl": 0.0018008711084706035,
"p99_itl": 0.005322679802775383,
"mean_e2el": 4.850761955510825,
"median_e2el": 4.878568011336029,
"std_e2el": 0.3434843673180606,
"p99_e2el": 5.45191275537014
},
{
"hw": "h200",
"conc": 4,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 1205.2855643059295,
"output_tput_per_gpu": 599.7103132426957,
"input_tput_per_gpu": 605.5752510632337,
"mean_ttft": 0.09322086737956851,
"median_ttft": 0.08326445147395134,
"std_ttft": 0.03651262068808106,
"p99_ttft": 0.21990989899262786,
"mean_tpot": 0.006407551066910208,
"mean_intvty": 156.06586503292763,
"median_tpot": 0.006456502248978125,
"median_intvty": 154.88262242273217,
"std_tpot": 0.0002853590925938926,
"std_intvty": 3504.3565316600757,
"p99_tpot": 0.006739616785952525,
"p99_intvty": 148.3763887116421,
"mean_itl": 0.006407411843906049,
"median_itl": 0.006325540132820606,
"std_itl": 0.003183980189916284,
"p99_itl": 0.0067286831885576244,
"mean_e2el": 5.967055317619815,
"median_e2el": 5.943884037900716,
"std_e2el": 0.4803241051638904,
"p99_e2el": 6.803628449877724
},
{
"hw": "mi325x",
"conc": 16,
"image": "rocm/7.0:rocm7.0_ubuntu_22.04_vllm_0.10.1_instinct_20250927_rc1",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 8,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 709.6330416443343,
"output_tput_per_gpu": 352.93794866275357,
"input_tput_per_gpu": 356.69509298158073,
"mean_ttft": 0.05460866419307422,
"median_ttft": 0.036223359405994415,
"std_ttft": 0.04458880670863597,
"p99_ttft": 0.24321747589390724,
"mean_tpot": 0.005484161376978709,
"mean_intvty": 182.34328482706178,
"median_tpot": 0.0055313428278475065,
"median_intvty": 180.7879263902261,
"std_tpot": 0.00016402024206782452,
"std_intvty": 6096.808463351048,
"p99_tpot": 0.005681595812423778,
"p99_intvty": 176.00688838395183,
"mean_itl": 0.005490763688738849,
"median_itl": 0.005242266692221165,
"std_itl": 0.00281435055632239,
"p99_itl": 0.018721084482967946,
"mean_e2el": 5.073715627577622,
"median_e2el": 5.063467476516962,
"std_e2el": 0.38584638526735265,
"p99_e2el": 5.732309597670101
},
{
"hw": "h100",
"conc": 64,
"image": "vllm/vllm-openai:v0.11.2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "vllm",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 4033.4545240303964,
"output_tput_per_gpu": 2016.27271625495,
"input_tput_per_gpu": 2017.1818077754465,
"mean_ttft": 0.16672684182703962,
"median_ttft": 0.0730973519384861,
"std_ttft": 0.2821703740361254,
"p99_ttft": 1.5173539397865534,
"mean_tpot": 0.015331196600349433,
"mean_intvty": 65.22648075475125,
"median_tpot": 0.015536693891418188,
"median_intvty": 64.36375763008098,
"std_tpot": 0.0007316627135696066,
"std_intvty": 1366.7499811781308,
"p99_tpot": 0.016199853015076763,
"p99_intvty": 61.72895513739089,
"mean_itl": 0.015340216348744011,
"median_itl": 0.01348150335252285,
"std_itl": 0.008504061665717754,
"p99_itl": 0.04666613228619099,
"mean_e2el": 14.292213659075788,
"median_e2el": 14.38238981924951,
"std_e2el": 1.1914569555255616,
"p99_e2el": 16.45965537374839
},
{
"hw": "h200",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 1,
"ep": 1,
"dp_attention": "false",
"tput_per_gpu": 4063.7126079634018,
"output_tput_per_gpu": 2035.0750412543484,
"input_tput_per_gpu": 2028.6375667090533,
"mean_ttft": 0.15633064151043072,
"median_ttft": 0.079410869628191,
"std_ttft": 0.20386189963409454,
"p99_ttft": 0.9011444221064449,
"mean_tpot": 0.015258425939882633,
"mean_intvty": 65.53755963688165,
"median_tpot": 0.015398884494118954,
"median_intvty": 64.93976887623995,
"std_tpot": 0.0006623476685897642,
"std_intvty": 1509.7811125826827,
"p99_tpot": 0.01597380663686257,
"p99_intvty": 62.60248560241812,
"mean_itl": 0.3018087166992809,
"median_itl": 0.2861004155129194,
"std_itl": 0.05083913556138468,
"p99_itl": 0.45191171709448097,
"mean_e2el": 14.238536544158706,
"median_e2el": 14.257613439112902,
"std_e2el": 1.1196632638173463,
"p99_e2el": 16.364655145332218
},
{
"hw": "b200-trt",
"conc": 32,
"image": "nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc2",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 2,
"ep": 1,
"dp_attention": "true",
"tput_per_gpu": 7287.449389847127,
"output_tput_per_gpu": 3649.4968513815934,
"input_tput_per_gpu": 3637.952538465534,
"mean_ttft": 0.10975873151764973,
"median_ttft": 0.09494058438576758,
"std_ttft": 0.047701083567479236,
"p99_ttft": 0.22764940907713027,
"mean_tpot": 0.004144237619993291,
"mean_intvty": 241.29890505690133,
"median_tpot": 0.004065766391145584,
"median_intvty": 245.95608891297775,
"std_tpot": 0.0003161979179387228,
"std_intvty": 3162.5761691251673,
"p99_tpot": 0.004562899037119071,
"p99_intvty": 219.15891451137196,
"mean_itl": 0.08192818886812822,
"median_itl": 0.08521746890619397,
"std_itl": 0.012190063778820611,
"p99_itl": 0.11000118218362331,
"mean_e2el": 3.932476688256429,
"median_e2el": 3.9003972506616265,
"std_e2el": 0.37793423291764666,
"p99_e2el": 4.738047747150995
},
{
"hw": "h200",
"conc": 4,
"image": "nvcr.io#nvidia/tensorrt-llm/release:gpt-oss-dev",
"model": "openai/gpt-oss-120b",
"infmax_model_prefix": "gptoss",
"framework": "trt",
"precision": "fp4",
"spec_decoding": "none",
"disagg": false,
"isl": 1024,
"osl": 1024,
"is_multinode": false,
"tp": 4,
"ep": 4,
"dp_attention": "false",
"tput_per_gpu": 388.35418831746114,
"output_tput_per_gpu": 193.23222547603834,
"input_tput_per_gpu": 195.1219628414228,
"mean_ttft": 0.08971246068831533,
"median_ttft": 0.07217608159407973,
"std_ttft": 0.041563753363602934,
"p99_ttft": 0.21258076050318778,
"mean_tpot": 0.004942168627841146,
"mean_intvty": 202.34032371267415,
"median_tpot": 0.00496552599677113,
"median_intvty": 201.3885337928464,
"std_tpot": 0.0001280293668227102,
"std_intvty": 7810.708002522258,
"p99_tpot": 0.005095494030341829,
"p99_intvty": 196.2518244639991,
"mean_itl": 0.09761023334908064,
"median_itl": 0.09628932643681765,
"std_itl": 0.016505959984747586,
"p99_itl": 0.16495122905820608,
"mean_e2el": 4.621267349272967,
"median_e2el": 4.650388915091753,
"std_e2el": 0.3475962791637259,
"p99_e2el": 5.262998614963144