Skip to content

Instantly share code, notes, and snippets.

@guilt
Last active July 19, 2024 04:03
Show Gist options
  • Save guilt/22838671058100e05fec8c209947060f to your computer and use it in GitHub Desktop.
Save guilt/22838671058100e05fec8c209947060f to your computer and use it in GitHub Desktop.

LLAMA Benchmark Comparisons

LLAMA.cpp Version

b3051

Model

TinyLlama 1.1B Chat

Results

Mac - M1 Mac Mini - Metal

$ ./llama-bench
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         pp512 |    813.68 ± 0.45 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         tg128 |     64.44 ± 0.25 |

build: 5921b8f0 (3051)

Mac - M3 Macbook Pro - Metal

$ ./llama-bench
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         pp512 |   2145.86 ± 5.20 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Metal      |  99 |         tg128 |    136.12 ± 0.16 |

build: 5921b8f0 (3051)

Linux - Dual RTX A6000 - CUDA

$ LD_LIBRARY_PATH=$(pwd)/lib64 ./bin/llama-bench 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         pp512 | 15309.25 ± 16.06 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         tg128 |    343.49 ± 2.14 |

build: 5921b8f0 (3051)

Windows - AMD Radeon 680M - OpenCL

C:\Users\Test\LLAMA\OpenCL> set GGML_OPENCL_PLATFORM=AMD
C:\Users\Test\LLAMA\OpenCL> set GGML_OPENCL_DEVICE=1
C:\Users\Test\LLAMA\OpenCL> llama-bench.exe
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1035'
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         pp512 |    147.39 ± 3.85 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         tg128 |     15.09 ± 0.19 |

build: 5921b8f0 (3051)

Windows - AMD Radeon 680M - Vulkan

C:\Users\Test\LLAMA\Vulkan> set GGML_VK_VISIBLE_DEVICES=0
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon(TM) Graphics | uma: 1 | fp16: 1 | warp size: 64
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         pp512 |    721.16 ± 1.19 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         tg128 |     46.64 ± 0.17 |

build: 5921b8f0 (3051)

Windows - RTX 3050 Ti 4GB Laptop - OpenCL

C:\Users\Test\LLAMA\OpenCL> llama-bench.exe
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 3050 Ti Laptop GPU'
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         pp512 |    371.89 ± 2.85 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | OpenCL     |  99 |         tg128 |     26.94 ± 0.24 |

build: 5921b8f0 (3051)

Windows - RTX 3050 Ti 4GB Laptop - Vulkan

C:\Users\Test\LLAMA\Vulkan> llama-bench.exe
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 3050 Ti Laptop GPU | uma: 0 | fp16: 1 | warp size: 32
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         pp512 |  1063.51 ± 24.66 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | Vulkan     |  99 |         tg128 |     56.70 ± 0.49 |

build: 5921b8f0 (3051)

Windows - RTX 3050 Ti 4GB Laptop - CUDA

C:\Users\Test\LLAMA\CUDA12.2> llama-bench.exe
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         pp512 |  3333.78 ± 13.32 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         tg128 |    169.94 ± 0.86 |

build: 5921b8f0 (3051)

Windows - RTX 4070 8GB Laptop - CUDA

C:\Users\Test\LLAMA\CUDA12.2> llama-bench.exe
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         pp512 | 7770.73 ± 356.52 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CUDA       |  99 |         tg128 |    250.36 ± 5.93 |

build: 5921b8f0 (3051)

Windows - Intel Arc Meteor Lake Laptop - SYCL

C:\Users\Test\LLAMA\SYCL> llama-bench.exe
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|    1.3|    128|    1024|   32| 31584M|            1.3.28597|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:128
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | SYCL       |  99 |         pp512 |    739.92 ± 2.46 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | SYCL       |  99 |         tg128 |     36.64 ± 0.13 |

build: 5921b8f0 (3051)

Windows - AMD 6900HS - CPU

C:\Users\Test\LLAMA\AVX2> llama-bench.exe
| model                          |       size |     params | backend    |    threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |          8 |         pp512 |    99.35 ± 17.66 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |          8 |         tg128 |     33.77 ± 0.59 |

build: 5921b8f0 (3051)

Windows - Intel Core 7 Ultra 155H - CPU

C:\Users\Test\LLAMA\AVX2> llama-bench.exe
| model                          |       size |     params | backend    |    threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ------------: | ---------------: |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |         11 |         pp512 |     89.68 ± 3.78 |
| llama 1B Q4_0                  | 606.54 MiB |     1.10 B | CPU        |         11 |         tg128 |     34.85 ± 0.14 |

build: 5921b8f0 (3051)

Summary

I ran this benchmark on potato laptop devices, a mac mini and a server to see how good they ran for ML tasks. This summary is preliminary and will be updated with more testing.

Here are a few things I found:

  • SYCL is good and delivers on Intel ARC Devices.
  • NVIDIA's OpenCL and Vulkan performance is not great. CUDA delivers.
  • Macs are great for their Power Envelope.
  • AMD does not do ROCm on APUs yet; Their OpenCL performance is not great. Vulkan is much much better.
  • CPUs are about 7-10x slower than GPUs on average.
  • Kompute did not work at all out of the box.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment