andrewginns/llm_perf_comparison.md

## llm_perf_comparison.md

      
    Raw
  

              llm_perf_comparison.md
            
          
    Based on ggerganov/llama.cpp#4167
LLaMA 7B

PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"


BW
[GB/s]
GPU
Cores
F16 PP
[t/s]
F16 TG
[t/s]
Q8_0 PP
[t/s]
Q8_0 TG
[t/s]
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]


✅ M1 Pro 16GB
200
14
262.65
12.75
235.16
21.95
232.55
35.52


✅ M3 Pro 36GB
150
18
357.45
9.89
344.66
17.53
341.67
30.74


✅ 3070 mobile 8GB*
448
5120
14.35
0.82
56.12
39.48
1735.10
64.22


*8GB VRAM means F16 and PP for Q8_0 tests did not fit in memory
Description

This is a collection of short llama.cpp benchmarks on various hardware configutations. It can be useful to compare the performance that llama.cpp achieves across devices.
CPU and Apple Silicon (Metal)
git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

CUDA
git checkout 8e672efe
make clean && LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef
Tests Used


model
size
params
test


llama 7B mostly F16
12.55 GiB
6.74 B
pp 512


llama 7B mostly F16
12.55 GiB
6.74 B
tg 128


llama 7B mostly Q8_0
6.67 GiB
6.74 B
pp 512


llama 7B mostly Q8_0
6.67 GiB
6.74 B
tg 128


llama 7B mostly Q4_0
3.56 GiB
6.74 B
pp 512


llama 7B mostly Q4_0
3.56 GiB
6.74 B
tg 128
	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ M1 Pro 16GB	200	14	262.65	12.75	235.16	21.95	232.55	35.52
✅ M3 Pro 36GB	150	18	357.45	9.89	344.66	17.53	341.67	30.74
✅ 3070 mobile 8GB*	448	5120	14.35	0.82	56.12	39.48	1735.10	64.22
model	size	params	test
llama 7B mostly F16	12.55 GiB	6.74 B	pp 512
llama 7B mostly F16	12.55 GiB	6.74 B	tg 128
llama 7B mostly Q8_0	6.67 GiB	6.74 B	pp 512
llama 7B mostly Q8_0	6.67 GiB	6.74 B	tg 128
llama 7B mostly Q4_0	3.56 GiB	6.74 B	pp 512
llama 7B mostly Q4_0	3.56 GiB	6.74 B	tg 128