lucataco/notes.txt

## notes.txt
**Goal**: Run benchmarks of SDXL, SVD, and Llama 13B on an L40S test node

**TL;DR**:

- L40S has same inference speeds as A40 for SDXL
- L40S has 10% faster inference speeds than A40S for llama2
- L40S are ~9% faster at Video rendering than A40s

**Process**: Run non-docker/cog python code for fp16

- SDXL:
    https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
    https://gist.github.com/lucataco/5cace44ef82c616fa7795bb6c202fb14

- Llama2-13b-chat:
    https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959
    https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959

- SVD:
    https://github.com/replicate/cog-svd/commit/09bf70cbc2fcb5a52db23cf433f14c625448ac4e

**Systems**:

3 VMs, each with a different GPU (& VRAM capacity):

- L4 (24Gb)
- A40(48Gb)
- L40S(45gb)

All running on CUDA: 12.2

Conda setup:
conda create -n bench python=3.10
conda activate bench
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

requirements-sdxl.txt:
diffusers==0.19.2
torch==2.0.1
transformers==4.31.0
invisible-watermark==0.2.0
accelerate==0.21.0
pandas==2.0.3
torchvision==0.15.2
numpy==1.25.1
pandas==2.0.3
fire==0.5.0
opencv-python>=4.1.0.25
mediapipe==0.10.2

requirements-svd.txt:
black==23.7.0
chardet==5.1.0
clip @ git+https://github.com/openai/CLIP.git
einops>=0.6.1
fairscale>=0.4.13
fsspec>=2023.6.0
invisible-watermark>=0.2.0
kornia==0.6.9
matplotlib>=3.7.2
natsort>=8.4.0
ninja>=1.11.1
numpy>=1.24.4
omegaconf>=2.3.0
open-clip-torch>=2.20.0
opencv-python==4.6.0.66
pandas>=2.0.3
pillow>=9.5.0
pudb>=2022.1.3
pytorch-lightning==2.0.1
pyyaml>=6.0.1
scipy>=1.10.1
streamlit>=0.73.1
tensorboardx==2.6
timm>=0.9.2
tokenizers==0.12.1
torch>=2.0.1
torchdata==0.6.1
torchmetrics>=1.0.1
torchvision>=0.15.2
tqdm>=4.65.0
transformers==4.19.1
triton==2.0.0
urllib3<1.27,>=1.25.4
wandb>=0.15.6
webdataset>=0.2.33
wheel>=0.41.0
xformers>=0.0.20
git+https://github.com/Stability-AI/generative-models.git

requirements-llama.txt:
accelerate==0.23.0
bitsandbytes==0.41.1
protobuf==3.20.3
scipy==1.11.2
sentencepiece==0.1.99
spaces==0.16.1
torch==2.0.0
transformers==4.34.0


Runs:
SDXL:
Single image tests run back to back
Runs x GPU	L4	        A40	        L40S
1x	        31.935 s	10.193 s	9.676 s
10x	        315.453 s	91.027 s	91.678 s
100x	    3124.300 s	907.273 s	915.423 s

**Runs are measured in seconds - (lower is better)*
L40S are the same speed as A40s for SDXL txt2img inference


**Llama2-13b-chat**:
Single prompt test with varying max new tokens
MaxTokens x GPU	L4	        A40	        L40S
512	            1.86 t/s	52.66 t/s	58.02 t/s
1024	        1.84 t/s	53.72 t/s	59.28 t/s
2048	        N/A	        53.48 t/s	59.42 t/s

**Runs are measured in tokens per second - (higher is better)*
L40S are 10.5% faster than A40 for llama2 inference


**Stable Video Diffusion**:
Single video tests run back to back
Runs x GPU	L4	        A40	        L40S
1x	        183.330 s	66.176 s	59.425 s
10x	        1798.206 s	630.390 s	584.991 s

**Runs are measured in seconds - (lower is better)*
L40S are (7.2% to 10.9%) ~9% faster for Video rendering than A40s
	Goal: Run benchmarks of SDXL, SVD, and Llama 13B on an L40S test node

	TL;DR:

	- L40S has same inference speeds as A40 for SDXL
	- L40S has 10% faster inference speeds than A40S for llama2
	- L40S are ~9% faster at Video rendering than A40s

	Process: Run non-docker/cog python code for fp16

	- SDXL:
	https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
	https://gist.github.com/lucataco/5cace44ef82c616fa7795bb6c202fb14

	- Llama2-13b-chat:
	https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959
	https://gist.github.com/lucataco/8f30c8cb6beb239fba9adb2237a90959

	- SVD:
	https://github.com/replicate/cog-svd/commit/09bf70cbc2fcb5a52db23cf433f14c625448ac4e

	Systems:

	3 VMs, each with a different GPU (& VRAM capacity):

	- L4 (24Gb)
	- A40(48Gb)
	- L40S(45gb)

	All running on CUDA: 12.2

	Conda setup:
	conda create -n bench python=3.10
	conda activate bench
	conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
	pip install -r requirements.txt

	requirements-sdxl.txt:
	diffusers==0.19.2
	torch==2.0.1
	transformers==4.31.0
	invisible-watermark==0.2.0
	accelerate==0.21.0
	pandas==2.0.3
	torchvision==0.15.2
	numpy==1.25.1
	pandas==2.0.3
	fire==0.5.0
	opencv-python>=4.1.0.25
	mediapipe==0.10.2

	requirements-svd.txt:
	black==23.7.0
	chardet==5.1.0
	clip @ git+https://github.com/openai/CLIP.git
	einops>=0.6.1
	fairscale>=0.4.13
	fsspec>=2023.6.0
	invisible-watermark>=0.2.0
	kornia==0.6.9
	matplotlib>=3.7.2
	natsort>=8.4.0
	ninja>=1.11.1
	numpy>=1.24.4
	omegaconf>=2.3.0
	open-clip-torch>=2.20.0
	opencv-python==4.6.0.66
	pandas>=2.0.3
	pillow>=9.5.0
	pudb>=2022.1.3
	pytorch-lightning==2.0.1
	pyyaml>=6.0.1
	scipy>=1.10.1
	streamlit>=0.73.1
	tensorboardx==2.6
	timm>=0.9.2
	tokenizers==0.12.1
	torch>=2.0.1
	torchdata==0.6.1
	torchmetrics>=1.0.1
	torchvision>=0.15.2
	tqdm>=4.65.0
	transformers==4.19.1
	triton==2.0.0
	urllib3<1.27,>=1.25.4
	wandb>=0.15.6
	webdataset>=0.2.33
	wheel>=0.41.0
	xformers>=0.0.20
	git+https://github.com/Stability-AI/generative-models.git

	requirements-llama.txt:
	accelerate==0.23.0
	bitsandbytes==0.41.1
	protobuf==3.20.3
	scipy==1.11.2
	sentencepiece==0.1.99
	spaces==0.16.1
	torch==2.0.0
	transformers==4.34.0


	Runs:
	SDXL:
	Single image tests run back to back
	Runs x GPU L4 A40 L40S
	1x 31.935 s 10.193 s 9.676 s
	10x 315.453 s 91.027 s 91.678 s
	100x 3124.300 s 907.273 s 915.423 s

	*Runs are measured in seconds - (lower is better)
	L40S are the same speed as A40s for SDXL txt2img inference


	Llama2-13b-chat:
	Single prompt test with varying max new tokens
	MaxTokens x GPU L4 A40 L40S
	512 1.86 t/s 52.66 t/s 58.02 t/s
	1024 1.84 t/s 53.72 t/s 59.28 t/s
	2048 N/A 53.48 t/s 59.42 t/s

	*Runs are measured in tokens per second - (higher is better)
	L40S are 10.5% faster than A40 for llama2 inference


	Stable Video Diffusion:
	Single video tests run back to back
	Runs x GPU L4 A40 L40S
	1x 183.330 s 66.176 s 59.425 s
	10x 1798.206 s 630.390 s 584.991 s

	*Runs are measured in seconds - (lower is better)
	L40S are (7.2% to 10.9%) ~9% faster for Video rendering than A40s