Skip to content

Instantly share code, notes, and snippets.

@hiro-v
Last active April 8, 2024 14:05
Show Gist options
  • Save hiro-v/27d6ab1548b1b29c79c8450d9759444c to your computer and use it in GitHub Desktop.
Save hiro-v/27d6ab1548b1b29c79c8450d9759444c to your computer and use it in GitHub Desktop.
Profile RAM and NVIDIA GPU VRAM on Windows
$username = "jan"
# Define paths using the username variable
$nitroPath1 = "C:\\Users\\$username\\jan\\engines\\nitro-tensorrt-llm\\0.1.8\\ampere\\nitro.exe"
$nitroPath2 = "C:\\Users\\$username\\jan\\extensions\\@janhq\\inference-nitro-extension\\dist\\bin\\win-cuda-12-0\\nitro.exe"
$modelPath1 = "C:\\Users\\$username\\jan\\models\\mistral-7b-instruct-int4"
$modelPath2 = "C:\\Users\\$username\\jan\\models\\mistral-ins-7b-q4\\mistral-7b-instruct-v0.2.Q4_K_M.gguf"
# Function to get current RAM and VRAM usage
function Get-MemoryUsage {
$ram = (Get-Process -Name "nitro" -ErrorAction SilentlyContinue).WS
$vramOutput = & "nvidia-smi" --query-gpu=memory.used --format=csv,noheader,nounits
Write-Output "VRAM Output: $vramOutput"
$vram = if ($vramOutput) { [int]$vramOutput.Trim() } else { 0 } # Default to 0 if null or empty
return @{ RAM = $ram; VRAM = $vram }
}
# Function to perform load model operation and check response
function Load-Model {
param (
[string]$uri,
[string]$body
)
# Print JSON input in a formatted manner
$jsonBody = $body | ConvertFrom-Json | ConvertTo-Json
Write-Output "Sending JSON request body:"
Write-Output $jsonBody
$response = Invoke-WebRequest -Uri $uri -Method Post -ContentType "application/json" -Body $body
if ($response.StatusCode -eq 200) {
Write-Output "Model loaded successfully."
Start-Sleep -Seconds 3 # Ensure the model is ready
# Print the response body if status code is 200
$responseContent = $response.Content | ConvertFrom-Json | ConvertTo-Json
Write-Output "Response Body:"
Write-Output $responseContent
} else {
Write-Output "Failed to load model. Status code: $($response.StatusCode)"
exit
}
}
# Function to start Nitro, perform actions, and monitor memory usage
function Start-Nitro {
param (
[string]$nitroPath,
[string]$modelType
)
# Start Nitro
Start-Process -FilePath $nitroPath
# Get Memory usage after starting Nitro
Start-Sleep -Seconds 5
$memoryAfterNitro = Get-MemoryUsage
Write-Output "RAM after starting Nitro: $($memoryAfterNitro.RAM) bytes"
Write-Output "VRAM after starting Nitro: $($memoryAfterNitro.VRAM) bytes"
# Determine the correct load model request
$webRequestUri = $null
$webRequestBody = $null
if ($modelType -eq "tensorrt_llm") {
$webRequestUri = "http://localhost:3928/inferences/tensorrtllm/loadmodel"
$webRequestBody = @"
{
"engine_path": "$modelPath1"
}
"@
} else {
$webRequestUri = "http://localhost:3928/inferences/llamacpp/loadmodel"
$webRequestBody = @"
{
"llama_model_path": "$modelPath2"
}
"@
}
# Load model and ensure it's ready
Load-Model -uri $webRequestUri -body $webRequestBody
# Monitor memory usage and calculate peak/average
$ramReadings = @()
$vramReadings = @()
$endTime = (Get-Date).AddSeconds(30)
while ((Get-Date) -lt $endTime) {
Start-Sleep -Seconds 3
$currentMemory = Get-MemoryUsage
$ramReadings += $currentMemory.RAM
$vramReadings += $currentMemory.VRAM
Write-Output "Current RAM: $($currentMemory.RAM) bytes"
Write-Output "Current VRAM: $($currentMemory.VRAM) bytes"
}
# Calculate peak and average for RAM and VRAM
$peakRAM = ($ramReadings | Measure-Object -Maximum).Maximum
$averageRAM = ($ramReadings | Measure-Object -Average).Average
$peakVRAM = ($vramReadings | Measure-Object -Maximum).Maximum
$averageVRAM = ($vramReadings | Measure-Object -Average).Average
Write-Output "Peak RAM Usage: $peakRAM bytes"
Write-Output "Average RAM Usage: $averageRAM bytes"
Write-Output "Peak VRAM Usage: $peakVRAM bytes"
Write-Output "Average VRAM Usage: $averageVRAM bytes"
}
# Execute for the first Nitro with type tensorrt_llm
# Start-Nitro -nitroPath $nitroPath1 -modelType "tensorrt_llm"
# Execute for the second Nitro with type llamacpp
Start-Nitro -nitroPath $nitroPath2 -modelType "llamacpp"
@hiro-v
Copy link
Author

hiro-v commented Apr 8, 2024

Steps to follow to benchmark Jan app on Windows 10/ 11:

  1. Open Jan App
  2. Install necessary extensions (e.g: TensorRT-LLM)
  3. Go to Hub, find models - In the case we want to benchmark llama.cpp Q4 and tensorrt-llm INT4
  4. Use Powershell with Admin privilege above gist to profile Nitro RAM and system VRAM (please make sure only nitro opened as GPU eater by checking nvidia-smi for any other programs). This by default runs in 60 seconds with 3s as interval.
  5. Server -> Enable it and make sure you can access http://localhost:1337
  6. Run llmperf benchmark:
  • Clone repo
git clone https://github.com/ray-project/llmperf
cd llmperf/
  • Fix the code as Windows set does not play nice with Python environment variables - os
    • Open totken_benchmark_ray.py
    • Add at line 27 (make sure the indentation is correct)
os.environ['OPENAI_API_BASE'] = 'http://localhost:1337/v1'
os.environ['OPENAI_API_KEY'] = 'abc'
conda create -n llmperf python=3.10
conda activate llmperf
pip install -e .
  1. Run benchmark scripts (llmperf and ps1 script have to run at the same time right after you click Jan Server start)
  • For mistral-ins-7b-q4 ~ llama.cpp Q4
python token_benchmark_ray.py --model "mistral-ins-7b-q4" --mean-input-tokens 2048 --stddev-input-tokens 150 --mean- output-tokens 512 --stddev-output-tokens 10 --max-num-completed-requests 2 --timeout 600 --num-concurrent-requests 1 
--results-dir "result_outputs" --llm-api openai --additional-sampling-params '{}'
  • For mistral-7b-instruct-int4 ~ TensorRT-LLM INT4
python token_benchmark_ray.py --model "mistral-ins-7b-q4" --mean-input-tokens 2048 --stddev-input-tokens 150 --mean- output-tokens 512 --stddev-output-tokens 10 --max-num-completed-requests 2 --timeout 600 --num-concurrent-requests 1 
--results-dir "result_outputs" --llm-api openai --additional-sampling-params '{}'
  1. Get log from the script
  • VRAM/ RAM usage from powershell terminal as in the form of
Screenshot_2024-04-07_at_23 04 15
  • llmperf benchmark result in the form of
inter_token_latency_s
    p25 = 0.01767195400281389
    p50 = 0.017797033005634153
    p75 = 0.017922112008454415
    p90 = 0.017997159410146568
    p95 = 0.01802217521071062
    p99 = 0.01804218785116186
    mean = 0.017797033005634153
    min = 0.017546874999993634
    max = 0.018047191011274673
    stddev = 0.00035377684431302806
ttft_s
    p25 = 2.926000000021304
    p50 = 2.930000000022119
    p75 = 2.934000000022934
    p90 = 2.936400000023423
    p95 = 2.937200000023586
    p99 = 2.937840000023716
    mean = 2.930000000022119
    min = 2.922000000020489
    max = 2.9380000000237487
    stddev = 0.011313708501289666
end_to_end_latency_s
    p25 = 8.281250000014552
    p50 = 8.515500000008615
    p75 = 8.749750000002678
    p90 = 8.890299999999115
    p95 = 8.937149999997928
    p99 = 8.974629999996978
    mean = 8.515500000008615
    min = 8.047000000020489
    max = 8.98399999999674
    stddev = 0.6625590539550021
request_output_throughput_token_per_s
    p25 = 54.75439761117469
    p50 = 55.2028426935949
    p75 = 55.65128777601511
    p90 = 55.920354825467236
    p95 = 56.01004384195127
    p99 = 56.08179505513851
    mean = 55.2028426935949
    min = 54.30595252875448
    max = 56.09973285843532
    stddev = 1.2683942350763595
number_input_tokens
    p25 = 2245.75
    p50 = 2262.5
    p75 = 2279.25
    p90 = 2289.3
    p95 = 2292.65
    p99 = 2295.33
    mean = 2262.5
    min = 2229
    max = 2296
    stddev = 47.37615433949868
number_output_tokens
    p25 = 461.75
    p50 = 478.5
    p75 = 495.25
    p90 = 505.3
    p95 = 508.65
    p99 = 511.33
    mean = 478.5
    min = 445
    max = 512
    stddev = 47.37615433949868
Number Of Errored Requests: 0
Overall Output Throughput: 51.903677188348375
Number Of Completed Requests: 2 

@hiro-v
Copy link
Author

hiro-v commented Apr 8, 2024

New instruction:

  1. Open Jan App -> Hub -> Make sure you installed tensorrt-llm extension and download 2 mistral models (q4 and int4)
  2. Close Jan App and other applications if possible
  3. Fix OPENAI_BASE_URL to http://localhost:3928/v1 (llmperf - token_benchmark_ray.py)
  4. Download above gist -> Open and update the username. It will try to find binaries under C:\Users\<username>\jan\
  5. Powershell ISE Admin in console runs Set-ExecutionPolicy RemoteSigned then Get-ExecutionPolicy to verify it's not Restricted
  6. Run ps1 gist in Powershell ISE Admin, here is the output (scroll down and comment/ uncomment llama.cpp/ tensorrt_llm section)
 Current VRAM: 1 %
Peak RAM Usage: 592707584 bytes
Average RAM Usage: 560823091.2 bytes
Peak VRAM Usage: 1 %
Average VRAM Usage: 0.3 % 
  1. Once the log for model loadded successfully shows up in step 4, run the benchmark script (make sure you use correct conda env and installed deps)
  • llama.cpp
python token_benchmark_ray.py --model "mistral-ins-7b-q4" --mean-input-tokens 2048 --stddev-input-tokens 150 --mean-output-tokens 512 --stddev-output-tokens 10 --max-num-completed-requests 2 --timeout 600 --num-concurrent-requests 1 --results-dir "result_outputs" --llm-api openai --additional-sampling-params '{}'
  • tensorrt_llm
python token_benchmark_ray.py --model "mistral-7b-instruct-int4" --mean-input-tokens 2048 --stddev-input-tokens 150 --mean-output-tokens 512 --stddev-output-tokens 10 --max-num-completed-requests 2 --timeout 600 --num-concurrent-requests 1 --results-dir "result_outputs" --llm-api openai --additional-sampling-params "{}"
  1. Manually close Nitro process then run again for 5 and 6 for runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment