Skip to content

Instantly share code, notes, and snippets.

@lucataco
Last active April 29, 2024 18:30
Show Gist options
  • Save lucataco/619a07ff433a4c62ffaa4439a0034e9a to your computer and use it in GitHub Desktop.
Save lucataco/619a07ff433a4c62ffaa4439a0034e9a to your computer and use it in GitHub Desktop.
L40S vs A40 Benchmarks
**Goal**: Run benchmarks of SDXL, SVD, and Llama 13B on an L40S test node
**TL;DR**:
- L40S has same inference speeds as A40 for SDXL
- L40S has 10% faster inference speeds than A40S for llama2
- L40S are ~9% faster at Video rendering than A40s
**Process**: Run non-docker/cog python code for fp16
- SDXL:
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
https://gist.github.com/lucataco/5cace44ef82c616fa7795bb6c202fb14
- Llama2-13b-chat:
https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959
https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959
- SVD:
https://github.com/replicate/cog-svd/commit/09bf70cbc2fcb5a52db23cf433f14c625448ac4e
**Systems**:
3 VMs, each with a different GPU (& VRAM capacity):
- L4 (24Gb)
- A40(48Gb)
- L40S(45gb)
All running on CUDA: 12.2
Conda setup:
conda create -n bench python=3.10
conda activate bench
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
requirements-sdxl.txt:
diffusers==0.19.2
torch==2.0.1
transformers==4.31.0
invisible-watermark==0.2.0
accelerate==0.21.0
pandas==2.0.3
torchvision==0.15.2
numpy==1.25.1
pandas==2.0.3
fire==0.5.0
opencv-python>=4.1.0.25
mediapipe==0.10.2
requirements-svd.txt:
black==23.7.0
chardet==5.1.0
clip @ git+https://github.com/openai/CLIP.git
einops>=0.6.1
fairscale>=0.4.13
fsspec>=2023.6.0
invisible-watermark>=0.2.0
kornia==0.6.9
matplotlib>=3.7.2
natsort>=8.4.0
ninja>=1.11.1
numpy>=1.24.4
omegaconf>=2.3.0
open-clip-torch>=2.20.0
opencv-python==4.6.0.66
pandas>=2.0.3
pillow>=9.5.0
pudb>=2022.1.3
pytorch-lightning==2.0.1
pyyaml>=6.0.1
scipy>=1.10.1
streamlit>=0.73.1
tensorboardx==2.6
timm>=0.9.2
tokenizers==0.12.1
torch>=2.0.1
torchdata==0.6.1
torchmetrics>=1.0.1
torchvision>=0.15.2
tqdm>=4.65.0
transformers==4.19.1
triton==2.0.0
urllib3<1.27,>=1.25.4
wandb>=0.15.6
webdataset>=0.2.33
wheel>=0.41.0
xformers>=0.0.20
git+https://github.com/Stability-AI/generative-models.git
requirements-llama.txt:
accelerate==0.23.0
bitsandbytes==0.41.1
protobuf==3.20.3
scipy==1.11.2
sentencepiece==0.1.99
spaces==0.16.1
torch==2.0.0
transformers==4.34.0
Runs:
SDXL:
Single image tests run back to back
Runs x GPU L4 A40 L40S
1x 31.935 s 10.193 s 9.676 s
10x 315.453 s 91.027 s 91.678 s
100x 3124.300 s 907.273 s 915.423 s
**Runs are measured in seconds - (lower is better)*
L40S are the same speed as A40s for SDXL txt2img inference
**Llama2-13b-chat**:
Single prompt test with varying max new tokens
MaxTokens x GPU L4 A40 L40S
512 1.86 t/s 52.66 t/s 58.02 t/s
1024 1.84 t/s 53.72 t/s 59.28 t/s
2048 N/A 53.48 t/s 59.42 t/s
**Runs are measured in tokens per second - (higher is better)*
L40S are 10.5% faster than A40 for llama2 inference
**Stable Video Diffusion**:
Single video tests run back to back
Runs x GPU L4 A40 L40S
1x 183.330 s 66.176 s 59.425 s
10x 1798.206 s 630.390 s 584.991 s
**Runs are measured in seconds - (lower is better)*
L40S are (7.2% to 10.9%) ~9% faster for Video rendering than A40s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment