Skip to content

Instantly share code, notes, and snippets.

@adriancbo
Forked from Artefact2/README.md
Created March 20, 2024 09:25
Show Gist options
  • Save adriancbo/4681640522267a4360e3337fc46d46f2 to your computer and use it in GitHub Desktop.
Save adriancbo/4681640522267a4360e3337fc46d46f2 to your computer and use it in GitHub Desktop.
GGUF quantizations overview

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggerganov/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

KL-divergence statistics for Mistral-7B

  • Last updated 2024-02-27 (add IQ4_XS).
  • imatrix from wiki.train, 200*512 tokens.
  • KL-divergence measured on wiki.test.

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
IQ3_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
IQ4_XS 4.32 0.0088 0.1082 0.0606 0.0079
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008

ROCm benchmarks for Mistral-7B

  • Last updated 2024-03-15 (bench #6083).

image

GiB pp512 -ngl 99 tg128 -ngl 99 pp512 -ngl 0 tg128 -ngl 0 pp512 -ngl 0 #6083
IQ1_S 1.50 709.29 74.85 324.35 15.66 585.61
IQ2_XS 2.05 704.52 58.44 316.10 15.11 557.68
IQ3_XS 2.79 682.72 45.79 300.61 10.49 527.83
IQ4_XS 3.64 712.96 64.17 292.36 11.06 495.92
Q4_0 3.83 870.44 63.42 310.94 10.44 554.56
Q5_K 4.78 691.40 46.52 273.83 8.54 453.58
Q6_K 5.53 661.98 47.57 261.16 7.34 415.22
Q8_0 7.17 881.95 39.74 270.70 5.74 440.44
f16 13.49 211.12 3.06 303.60
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment