Skip to content

Instantly share code, notes, and snippets.

@Artefact2
Last active September 6, 2024 17:03
Show Gist options
  • Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.
Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.
GGUF quantizations overview

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggerganov/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

KL-divergence statistics for Mistral-7B

  • Last updated 2024-02-27 (add IQ4_XS).
  • imatrix from wiki.train, 200*512 tokens.
  • KL-divergence measured on wiki.test.

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
IQ3_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
IQ4_XS 4.32 0.0088 0.1082 0.0606 0.0079
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008

ROCm benchmarks for Mistral-7B

  • Last updated 2024-03-15 (bench #6083).

image

GiB pp512 -ngl 99 tg128 -ngl 99 pp512 -ngl 0 tg128 -ngl 0 pp512 -ngl 0 #6083
IQ1_S 1.50 709.29 74.85 324.35 15.66 585.61
IQ2_XS 2.05 704.52 58.44 316.10 15.11 557.68
IQ3_XS 2.79 682.72 45.79 300.61 10.49 527.83
IQ4_XS 3.64 712.96 64.17 292.36 11.06 495.92
Q4_0 3.83 870.44 63.42 310.94 10.44 554.56
Q5_K 4.78 691.40 46.52 273.83 8.54 453.58
Q6_K 5.53 661.98 47.57 261.16 7.34 415.22
Q8_0 7.17 881.95 39.74 270.70 5.74 440.44
f16 13.49 211.12 3.06 303.60
@cha0sbuster
Copy link

cha0sbuster commented May 7, 2024

@cha0sbuster Hi man, I have read your writing, you did a great job~! I think you can event make your Rentry note into a article on Medium , I'm sure people like me who knows a bit AI but when looking at the naming of LLM and gets confused and frustrated will appreciate your great explanation :)

thanks! ^^ I used Rentry because it's standard in the parts of the community I frequent, and my presence on Medium is ill-fit for this (I write about music there already.)

One more question, I noticed that when you use "I", you use it together with "Q", but when you use "K", you mentioned it alone. Why?

Since they're split by underscore it's more common to say "IQ" or "K". It's just convention.

@franva
Copy link

franva commented May 8, 2024

okiee, thank you !!! All done.

@Weroxig
Copy link

Weroxig commented May 21, 2024

will you be able to check the numbers for a larger model too? some people say that larger models (lets say for example any 70b+ model) are impacted by quantization less than small models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment