$ ./llama-bench
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Metal | 99 | pp512 | 813.68 ± 0.45 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Metal | 99 | tg128 | 64.44 ± 0.25 |
build: 5921b8f0 (3051)
$ ./llama-bench
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Metal | 99 | pp512 | 2145.86 ± 5.20 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Metal | 99 | tg128 | 136.12 ± 0.16 |
build: 5921b8f0 (3051)
$ LD_LIBRARY_PATH=$(pwd)/lib64 ./bin/llama-bench
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CUDA | 99 | pp512 | 15309.25 ± 16.06 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CUDA | 99 | tg128 | 343.49 ± 2.14 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\OpenCL> set GGML_OPENCL_PLATFORM=AMD
C:\Users\Test\LLAMA\OpenCL> set GGML_OPENCL_DEVICE=1
C:\Users\Test\LLAMA\OpenCL> llama-bench.exe
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1035'
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | OpenCL | 99 | pp512 | 147.39 ± 3.85 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | OpenCL | 99 | tg128 | 15.09 ± 0.19 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\Vulkan> set GGML_VK_VISIBLE_DEVICES=0
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) Graphics | uma: 1 | fp16: 1 | warp size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Vulkan | 99 | pp512 | 721.16 ± 1.19 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Vulkan | 99 | tg128 | 46.64 ± 0.17 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\OpenCL> llama-bench.exe
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3050 Ti Laptop GPU'
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | OpenCL | 99 | pp512 | 371.89 ± 2.85 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | OpenCL | 99 | tg128 | 26.94 ± 0.24 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\Vulkan> llama-bench.exe
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 3050 Ti Laptop GPU | uma: 0 | fp16: 1 | warp size: 32
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Vulkan | 99 | pp512 | 1063.51 ± 24.66 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | Vulkan | 99 | tg128 | 56.70 ± 0.49 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\CUDA12.2> llama-bench.exe
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CUDA | 99 | pp512 | 3333.78 ± 13.32 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CUDA | 99 | tg128 | 169.94 ± 0.86 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\CUDA12.2> llama-bench.exe
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CUDA | 99 | pp512 | 7770.73 ± 356.52 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CUDA | 99 | tg128 | 250.36 ± 5.93 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\SYCL> llama-bench.exe
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc Graphics| 1.3| 128| 1024| 32| 31584M| 1.3.28597|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | SYCL | 99 | pp512 | 739.92 ± 2.46 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | SYCL | 99 | tg128 | 36.64 ± 0.13 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\AVX2> llama-bench.exe
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CPU | 8 | pp512 | 99.35 ± 17.66 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CPU | 8 | tg128 | 33.77 ± 0.59 |
build: 5921b8f0 (3051)
C:\Users\Test\LLAMA\AVX2> llama-bench.exe
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ---------------: |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CPU | 11 | pp512 | 89.68 ± 3.78 |
| llama 1B Q4_0 | 606.54 MiB | 1.10 B | CPU | 11 | tg128 | 34.85 ± 0.14 |
build: 5921b8f0 (3051)
I ran this benchmark on potato laptop devices, a mac mini and a server to see how good they ran for ML tasks. This summary is preliminary and will be updated with more testing.
Here are a few things I found:
- SYCL is good and delivers on Intel ARC Devices.
- NVIDIA's OpenCL and Vulkan performance is not great. CUDA delivers.
- Macs are great for their Power Envelope.
- AMD does not do ROCm on APUs yet; Their OpenCL performance is not great. Vulkan is much much better.
- CPUs are about 7-10x slower than GPUs on average.
- Kompute did not work at all out of the box.