guilt/LLAMA Benchmark Comparisons.md

## LLAMA Benchmark Comparisons.md

      
    Raw
  

              LLAMA Benchmark Comparisons.md
            
          
    LLAMA Benchmark Comparisons

LLAMA.cpp Version

b3051
Model

TinyLlama 1.1B Chat
Results

Mac - M1 Mac Mini - Metal

$ ./llama-bench
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         pp512 |    813.68 ± 0.45 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         tg128 |     64.44 ± 0.25 |

build: 5921b8f0 (3051)
Mac - M3 Macbook Pro - Metal

$ ./llama-bench
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         pp512 |   2145.86 ± 5.20 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         tg128 |    136.12 ± 0.16 |

build: 5921b8f0 (3051)
Linux - Dual RTX A6000 - CUDA

$ LD_LIBRARY_PATH=$(pwd)/lib64 ./bin/llama-bench 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         pp512 | 15309.25 ± 16.06 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         tg128 |    343.49 ± 2.14 |

build: 5921b8f0 (3051)
Windows - AMD Radeon 680M - OpenCL

C:\Users\Test\LLAMA\OpenCL> set GGML_OPENCL_PLATFORM=AMD
C:\Users\Test\LLAMA\OpenCL> set GGML_OPENCL_DEVICE=1
C:\Users\Test\LLAMA\OpenCL> llama-bench.exe
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1035'
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         pp512 |    147.39 ± 3.85 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         tg128 |     15.09 ± 0.19 |

build: 5921b8f0 (3051)
Windows - AMD Radeon 680M - Vulkan

C:\Users\Test\LLAMA\Vulkan> set GGML_VK_VISIBLE_DEVICES=0
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) Graphics | uma: 1 | fp16: 1 | warp size: 64
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         pp512 |    721.16 ± 1.19 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         tg128 |     46.64 ± 0.17 |

build: 5921b8f0 (3051)
Windows - RTX 3050 Ti 4GB Laptop - OpenCL

C:\Users\Test\LLAMA\OpenCL> llama-bench.exe
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3050 Ti Laptop GPU'
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         pp512 |    371.89 ± 2.85 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         tg128 |     26.94 ± 0.24 |

build: 5921b8f0 (3051)
Windows - RTX 3050 Ti 4GB Laptop - Vulkan

C:\Users\Test\LLAMA\Vulkan> llama-bench.exe
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 3050 Ti Laptop GPU | uma: 0 | fp16: 1 | warp size: 32
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         pp512 |  1063.51 ± 24.66 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         tg128 |     56.70 ± 0.49 |

build: 5921b8f0 (3051)
Windows - RTX 3050 Ti 4GB Laptop - CUDA

C:\Users\Test\LLAMA\CUDA12.2> llama-bench.exe
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         pp512 |  3333.78 ± 13.32 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         tg128 |    169.94 ± 0.86 |

build: 5921b8f0 (3051)
Windows - RTX 4070 8GB Laptop - CUDA

C:\Users\Test\LLAMA\CUDA12.2> llama-bench.exe
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         pp512 | 7770.73 ± 356.52 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         tg128 |    250.36 ± 5.93 |

build: 5921b8f0 (3051)
Windows - Intel Arc Meteor Lake Laptop - SYCL

C:\Users\Test\LLAMA\SYCL> llama-bench.exe
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 31584M|            1.3.28597|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | SYCL       |  99 |         pp512 |    739.92 ± 2.46 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | SYCL       |  99 |         tg128 |     36.64 ± 0.13 |

build: 5921b8f0 (3051)
Windows - AMD 6900HS - CPU

C:\Users\Test\LLAMA\AVX2> llama-bench.exe
| model                          |       size |     params | backend    |    threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |          8 |         pp512 |    99.35 ± 17.66 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |          8 |         tg128 |     33.77 ± 0.59 |

build: 5921b8f0 (3051)
Windows - Intel Core 7 Ultra 155H - CPU

C:\Users\Test\LLAMA\AVX2> llama-bench.exe
| model                          |       size |     params | backend    |    threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |         11 |         pp512 |     89.68 ± 3.78 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |         11 |         tg128 |     34.85 ± 0.14 |

build: 5921b8f0 (3051)
Summary

I ran this benchmark on potato laptop devices, a mac mini and a server to see how good they
ran for ML tasks. This summary is preliminary and will be updated with more testing.
Here are a few things I found:

SYCL is good and delivers on Intel ARC Devices.
NVIDIA's OpenCL and Vulkan performance is not great. CUDA delivers.
Macs are great for their Power Envelope.
AMD does not do ROCm on APUs yet; Their OpenCL performance is not great. Vulkan is much much better.
CPUs are about 7-10x slower than GPUs on average.
Kompute did not work at all out of the box.