Skip to content

Instantly share code, notes, and snippets.

@arthurrasmusson
Last active May 29, 2025 21:16
Show Gist options
  • Select an option

  • Save arthurrasmusson/bae49d4f339241b66ff43f094b5f3b74 to your computer and use it in GitHub Desktop.

Select an option

Save arthurrasmusson/bae49d4f339241b66ff43f094b5f3b74 to your computer and use it in GitHub Desktop.

Setup Guide, WEKA Augmented Memory Grid

This guide walks you through:

  1. Cloning Nvidia TensorRT-LLM
  2. Building TensorRT-LLM with GDS support
  3. Running the provided TTFT benchmark script with --kvcache-transfer-mode GDS to compare results

By the end, you will be shown benchmark results.

  • AMG On (enabled)
  • AMG Off (disabled)

Here's an example of results produced for an AMG-ON and AMG-OFF model evaluation scaling input token length in a 2 turn conversation:

FP8 Inference (Llama‑3.1‑70B, 8x H100)

-> (T1) is Turn 1

(AMG-ON cache generation TTFT / AMG-OFF no cache TTFT)

-> (T2) is Turn 2

(AMG-ON cache restore TTFT / AMG-OFF no cache TTFT)

Tokens AMG-ON t1 (ms) AMG-ON t2 (ms) AMG-OFF t1 (ms) AMG-OFF t2 (ms) percentage performance gain (t2)
50 39.439 22.308 43.725 29.327 31.47
1000 58.850 28.046 56.905 56.246 100.55
2000 97.152 24.662 95.072 95.206 286.05
8000 317.714 37.862 316.145 316.258 735.30
16000 641.439 51.424 633.103 634.030 1132.96
24000 1007.591 69.586 984.308 983.485 1313.34
32000 1367.059 97.098 1371.541 1381.308 1322.60
64000 3156.745 145.569 3143.190 3149.996 2063.92
96000 5290.473 224.575 5277.509 5279.433 2250.86
128000 7860.608 254.781 7846.349 7859.110 2984.65

You can then compare their TTFT performance with and without WEKA AMG-ON.

The results above show a 2984.65% improvement.


Supported Platforms

WEKA Augmented Memory Grid (AMG) works with all WEKA Cloud Partner and on-premise AI platforms such as those deployed with WEKA converged mode and using the dedicated WEKAPOD appliance.

Weka Data Platform (Required)

For installation of the Weka Data Platform refer to docs.weka.io.

Nvidia GPUDirect Storage (Enhancement/Optional)

This guide provides end-to-end support for end users of the WEKA AMG platform on systems with and without compatibility GPUDirect Storage.

WEKA AMG customers may use use Nvidia GPUDirect Storage as an optional dependancy to enable best performance on Nvidia GPU platforms. For installation of Nvidia GPUDirect Storage refer to Nvidia MagnumIO GPUDirect Storage Installation Documentation.

DGX SuperPOD deployed with WEKAPOD is an Nvidia officially certified platform for GPUDirect Storage.

1. Clone Source Code

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

git submodule update --init --recursive
git lfs pull

2. (Dev build) Start the TRT-LLM Development Docker Container with GDS

Set up the docker dev environment using docker/dockerfile.multi.

docker build --pull  \
            --target devel \
            --file docker/Dockerfile.multi \
            --tag tensorrt_llm/devel:latest \
            .

From the repository’s root ($(pwd) must be the repo root), run:

Set $CUDAVER=12.6 (for example)

docker run --gpus all -it --network=host --ipc=host \
    -v /mnt/weka:/mnt/weka \
    -v "$(pwd)":/code/tensorrt_llm \
    -v /usr/local/cuda-$CUDAVER/targets/x86_64-linux/lib:/usr/local/cuda-$CUDAVER/targets/x86_64-linux/lib \
    -w /code/tensorrt_llm \
    tensorrt_llm/devel:latest

Note: Your Docker image should be built with -DENABLE_CUFILE=ON, so it includes cuFile libraries for GDS.

Once in the Dev container run these commands to build TensorRT-LLM:

pip3 uninstall -y tensorrt_llm || true && \
python3 ./scripts/build_wheel.py \
        --fast_build --use_ccache --job_count 112 --benchmarks \
        --cuda_architectures "90-real" \
        --trt_root /usr/local/tensorrt \
        --extra-cmake-vars "ENABLE_CUFILE=ON" && \
cd cpp/build && make modelSpec -j"$(nproc)" && cd ../.. && \
pip install ./build/tensorrt_llm*.whl

3. Run the Benchmark Script with GDS Arguments

Download an LLM model, for example: https://huggingface.co/meta-llama/Llama-3.1-70B

Download the benchmark script here: https://gist.githubusercontent.com/arthurrasmusson/819ecf189071bf6e059ea51636810183/

Save the benchmark script as file named AMG-TTFT-Test.py. Pass --kvcache-transfer-mode=GDS and --kvcache-directory=$WEKAPATH at the command line:

3.1 Start Evaluation Program

Set $TOKEN-INPUTS-MAXIMUM to the maximum number of tokens to test (set this to 128000 for Llama-3.1 for example).

Load the Safetensors

If your model is bf16 use $DATA-TYPE=bf16:

python3 AMG-TTFT-Test.py \
  --kvcache-transfer-mode=GDS \
  --context-step-benchmark=$TOKEN-INPUTS-MAXIMUM \
  --kvcache-directory=$PATH-TO-WEKA-FS \
  --safetensor-model-path=$PATH-TO-SAFETENSOR-CKPT \
  --tokenizer-path=$PATH-TO-TOKENIZER \
  --safetensor-dtype=$DATA-TYPE \
  --flush-gpu-memory-between-tests=true|false

Full Example:

python3 AMG-TTFT-Test.py 
     --safetensor-model-path=/mnt/weka/Models/Quantization/Llama-3.1_70B_fp8/ 
     --tokenizer-path=/mnt/weka/Models/Safetensors/Llama-3.1-70B/ 
     --safetensor-dtype=fp16
     --kvcache-transfer-mode=GDS
     --kvcache-directory=/mnt/weka/cuda_kernel_cache
     --context-step-benchmark=128000
     --secondary-offload-min-priority-amg-on=80
     --secondary-offload-min-priority-amg-off=100
     --flush-gpu-memory-between-tests=false
     --retention-duration-secs=999
     --range-retention-priority=90
     --decode-retention-priority=90
     --tensor-parallel-size=8
     --pipeline-parallel-size=1

TIP: For more detailed log output including cache hit % set TLLM_LOG_LEVEL=DEBUG before running the TTFT Tester.

You will find the evaluation results at /code/tensorrt_llm/benchmark_results.txt

This will enable GPUDirect Storage (GDS) offload paths for KV cache blocks. The script should:

  • Load the engine twice (AMG On and AMG Off).
  • Run a two-turn conversation (or similar) in streaming mode.
  • Measure Time-To-First-Token (TTFT) for the engine with GPUDirect Storage offload for WEKA AMG-On & AMG-Off.

Check the TTFT measurements for “AMG On” (paged context FMHA enable) versus “AMG Off” (paged context FMHA disable).

You can compare:

  • How quickly each engine produces the first token in a streaming scenario.
  • Improvements in retention length of Large Language Model Key Value cache.

Compare Results

  1. Review the script’s console output: It should display TTFT for each engine (both turns, if applicable).
  2. Look at the percentage difference in TTFT. Some lines might read:
    128000,7860.608,254.781,7846.349,7859.110,2984.65
    

Results shown as follows:

token_count, AMG_ON_turn1_ms, AMG_ON_turn2_ms, AMG_OFF_turn1_ms, AMG_OFF_turn2_ms, percent_diff_turn2

  1. Draw conclusions about how GPUDirect Storage and WEKA AMG affect latency.

Note:

token_count indicates the step size of the current test (ie: 128000 tokens in the example).

AMG_ON_turn1_ms shows turn 1 time to first token (spent mostly in pre-fill).

AMG_ON_turn2_ms shows the turn 2 Time To First Token with cache data restored from Augmented Memory Grid (254.781 MS).

percent_diff_turn2 shows a percentage improvement over turn 2 AMG-OFF (ie: 2984.65% improvement in the example above).


Thanks for using WEKA!

If you'd like to run this setup with vLLM+LMCache see this link: https://gist.github.com/4D0R/64a2ecda5dcef1e9405de5d44edf12c0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment