This guide walks you through:
- Cloning Nvidia TensorRT-LLM
- Building TensorRT-LLM with GDS support
- Running the provided TTFT benchmark script with
--kvcache-transfer-mode GDSto compare results
By the end, you will be shown benchmark results.
- AMG On (enabled)
- AMG Off (disabled)
Here's an example of results produced for an AMG-ON and AMG-OFF model evaluation scaling input token length in a 2 turn conversation:
-> (T1) is Turn 1
(AMG-ON cache generation TTFT / AMG-OFF no cache TTFT)
-> (T2) is Turn 2
(AMG-ON cache restore TTFT / AMG-OFF no cache TTFT)
| Tokens | AMG-ON t1 (ms) | AMG-ON t2 (ms) | AMG-OFF t1 (ms) | AMG-OFF t2 (ms) | percentage performance gain (t2) |
|---|---|---|---|---|---|
| 50 | 39.439 | 22.308 | 43.725 | 29.327 | 31.47 |
| 1000 | 58.850 | 28.046 | 56.905 | 56.246 | 100.55 |
| 2000 | 97.152 | 24.662 | 95.072 | 95.206 | 286.05 |
| 8000 | 317.714 | 37.862 | 316.145 | 316.258 | 735.30 |
| 16000 | 641.439 | 51.424 | 633.103 | 634.030 | 1132.96 |
| 24000 | 1007.591 | 69.586 | 984.308 | 983.485 | 1313.34 |
| 32000 | 1367.059 | 97.098 | 1371.541 | 1381.308 | 1322.60 |
| 64000 | 3156.745 | 145.569 | 3143.190 | 3149.996 | 2063.92 |
| 96000 | 5290.473 | 224.575 | 5277.509 | 5279.433 | 2250.86 |
| 128000 | 7860.608 | 254.781 | 7846.349 | 7859.110 | 2984.65 |
You can then compare their TTFT performance with and without WEKA AMG-ON.
WEKA Augmented Memory Grid (AMG) works with all WEKA Cloud Partner and on-premise AI platforms such as those deployed with WEKA converged mode and using the dedicated WEKAPOD appliance.
For installation of the Weka Data Platform refer to docs.weka.io.
This guide provides end-to-end support for end users of the WEKA AMG platform on systems with and without compatibility GPUDirect Storage.
WEKA AMG customers may use use Nvidia GPUDirect Storage as an optional dependancy to enable best performance on Nvidia GPU platforms. For installation of Nvidia GPUDirect Storage refer to Nvidia MagnumIO GPUDirect Storage Installation Documentation.
DGX SuperPOD deployed with WEKAPOD is an Nvidia officially certified platform for GPUDirect Storage.
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pullSet up the docker dev environment using docker/dockerfile.multi.
docker build --pull \
--target devel \
--file docker/Dockerfile.multi \
--tag tensorrt_llm/devel:latest \
.From the repository’s root ($(pwd) must be the repo root), run:
Set $CUDAVER=12.6 (for example)
docker run --gpus all -it --network=host --ipc=host \
-v /mnt/weka:/mnt/weka \
-v "$(pwd)":/code/tensorrt_llm \
-v /usr/local/cuda-$CUDAVER/targets/x86_64-linux/lib:/usr/local/cuda-$CUDAVER/targets/x86_64-linux/lib \
-w /code/tensorrt_llm \
tensorrt_llm/devel:latestNote: Your Docker image should be built with
-DENABLE_CUFILE=ON, so it includes cuFile libraries for GDS.
Once in the Dev container run these commands to build TensorRT-LLM:
pip3 uninstall -y tensorrt_llm || true && \
python3 ./scripts/build_wheel.py \
--fast_build --use_ccache --job_count 112 --benchmarks \
--cuda_architectures "90-real" \
--trt_root /usr/local/tensorrt \
--extra-cmake-vars "ENABLE_CUFILE=ON" && \
cd cpp/build && make modelSpec -j"$(nproc)" && cd ../.. && \
pip install ./build/tensorrt_llm*.whl
Download an LLM model, for example: https://huggingface.co/meta-llama/Llama-3.1-70B
Download the benchmark script here: https://gist.githubusercontent.com/arthurrasmusson/819ecf189071bf6e059ea51636810183/
Save the benchmark script as file named AMG-TTFT-Test.py. Pass --kvcache-transfer-mode=GDS and --kvcache-directory=$WEKAPATH at the command line:
Set $TOKEN-INPUTS-MAXIMUM to the maximum number of tokens to test (set this to 128000 for Llama-3.1 for example).
If your model is bf16 use $DATA-TYPE=bf16:
python3 AMG-TTFT-Test.py \
--kvcache-transfer-mode=GDS \
--context-step-benchmark=$TOKEN-INPUTS-MAXIMUM \
--kvcache-directory=$PATH-TO-WEKA-FS \
--safetensor-model-path=$PATH-TO-SAFETENSOR-CKPT \
--tokenizer-path=$PATH-TO-TOKENIZER \
--safetensor-dtype=$DATA-TYPE \
--flush-gpu-memory-between-tests=true|falseFull Example:
python3 AMG-TTFT-Test.py
--safetensor-model-path=/mnt/weka/Models/Quantization/Llama-3.1_70B_fp8/
--tokenizer-path=/mnt/weka/Models/Safetensors/Llama-3.1-70B/
--safetensor-dtype=fp16
--kvcache-transfer-mode=GDS
--kvcache-directory=/mnt/weka/cuda_kernel_cache
--context-step-benchmark=128000
--secondary-offload-min-priority-amg-on=80
--secondary-offload-min-priority-amg-off=100
--flush-gpu-memory-between-tests=false
--retention-duration-secs=999
--range-retention-priority=90
--decode-retention-priority=90
--tensor-parallel-size=8
--pipeline-parallel-size=1
TIP: For more detailed log output including cache hit % set TLLM_LOG_LEVEL=DEBUG before running the TTFT Tester.
You will find the evaluation results at /code/tensorrt_llm/benchmark_results.txt
This will enable GPUDirect Storage (GDS) offload paths for KV cache blocks. The script should:
- Load the engine twice (AMG On and AMG Off).
- Run a two-turn conversation (or similar) in streaming mode.
- Measure Time-To-First-Token (TTFT) for the engine with GPUDirect Storage offload for WEKA AMG-On & AMG-Off.
Check the TTFT measurements for “AMG On” (paged context FMHA enable) versus “AMG Off” (paged context FMHA disable).
You can compare:
- How quickly each engine produces the first token in a streaming scenario.
- Improvements in retention length of Large Language Model Key Value cache.
- Review the script’s console output: It should display TTFT for each engine (both turns, if applicable).
- Look at the percentage difference in TTFT. Some lines might read:
128000,7860.608,254.781,7846.349,7859.110,2984.65
Results shown as follows:
token_count, AMG_ON_turn1_ms, AMG_ON_turn2_ms, AMG_OFF_turn1_ms, AMG_OFF_turn2_ms, percent_diff_turn2
- Draw conclusions about how GPUDirect Storage and WEKA AMG affect latency.
token_count indicates the step size of the current test (ie: 128000 tokens in the example).
AMG_ON_turn1_ms shows turn 1 time to first token (spent mostly in pre-fill).
AMG_ON_turn2_ms shows the turn 2 Time To First Token with cache data restored from Augmented Memory Grid (254.781 MS).
percent_diff_turn2 shows a percentage improvement over turn 2 AMG-OFF (ie: 2984.65% improvement in the example above).
Thanks for using WEKA!
If you'd like to run this setup with vLLM+LMCache see this link: https://gist.github.com/4D0R/64a2ecda5dcef1e9405de5d44edf12c0