Skip to content

Instantly share code, notes, and snippets.

@Artefact2
Last active April 30, 2024 17:18
Show Gist options
  • Star 63 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.
Save Artefact2/b5f810600771265fc1e39442288e8ec9 to your computer and use it in GitHub Desktop.
GGUF quantizations overview

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggerganov/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

KL-divergence statistics for Mistral-7B

  • Last updated 2024-02-27 (add IQ4_XS).
  • imatrix from wiki.train, 200*512 tokens.
  • KL-divergence measured on wiki.test.

image

Bits per weight KL-divergence median KL-divergence q99 Top tokens differ ln(PPL(Q)/PPL(base))
IQ1_S 1.78 0.5495 5.5174 0.3840 0.9235
IQ2_XXS 2.20 0.1751 2.4983 0.2313 0.2988
IQ2_XS 2.43 0.1146 1.7693 0.1943 0.2046
IQ2_S 2.55 0.0949 1.6284 0.1806 0.1722
IQ2_M 2.76 0.0702 1.0935 0.1557 0.1223
Q2_K_S 2.79 0.0829 1.5111 0.1735 0.1600
Q2_K 3.00 0.0588 1.0337 0.1492 0.1103
IQ3_XXS 3.21 0.0330 0.5492 0.1137 0.0589
IQ3_XS 3.32 0.0296 0.4550 0.1071 0.0458
Q3_K_S 3.50 0.0304 0.4481 0.1068 0.0511
IQ3_S 3.52 0.0205 0.3018 0.0895 0.0306
IQ3_M 3.63 0.0186 0.2740 0.0859 0.0268
Q3_K_M 3.89 0.0171 0.2546 0.0839 0.0258
Q3_K_L 4.22 0.0152 0.2202 0.0797 0.0205
IQ4_XS 4.32 0.0088 0.1082 0.0606 0.0079
IQ4_NL 4.56 0.0085 0.1077 0.0605 0.0074
Q4_K_S 4.57 0.0083 0.1012 0.0600 0.0081
Q4_K_M 4.83 0.0075 0.0885 0.0576 0.0060
Q5_K_S 5.52 0.0045 0.0393 0.0454 0.0005
Q5_K_M 5.67 0.0043 0.0368 0.0444 0.0005
Q6_K 6.57 0.0032 0.0222 0.0394 −0.0008

ROCm benchmarks for Mistral-7B

  • Last updated 2024-03-15 (bench #6083).

image

GiB pp512 -ngl 99 tg128 -ngl 99 pp512 -ngl 0 tg128 -ngl 0 pp512 -ngl 0 #6083
IQ1_S 1.50 709.29 74.85 324.35 15.66 585.61
IQ2_XS 2.05 704.52 58.44 316.10 15.11 557.68
IQ3_XS 2.79 682.72 45.79 300.61 10.49 527.83
IQ4_XS 3.64 712.96 64.17 292.36 11.06 495.92
Q4_0 3.83 870.44 63.42 310.94 10.44 554.56
Q5_K 4.78 691.40 46.52 273.83 8.54 453.58
Q6_K 5.53 661.98 47.57 261.16 7.34 415.22
Q8_0 7.17 881.95 39.74 270.70 5.74 440.44
f16 13.49 211.12 3.06 303.60
@tau0-deltav
Copy link

tau0-deltav commented Mar 3, 2024

Thank you! I have a question for you and advice for everyone else:
Question:
"I am partially offloading (running on CPU+GPU): use Q4_K_S"

  • What about the 2 and 3 bit regular K quants? I know they're slower but if I truly have no more vram, do i want FFN from these on the CPU or fewer IQ layers? IQ is more expensive to calculate but idk if the hidden state getting squeezed out of the PCIe tubes is any smaller. Could depend where the 'neck is?

  • Is IQ4_NL possibly faster and better? I thought it was supposed to be like Q4_0, which definitely makes CPUs happy?

Nuance:
Fitting as much as possible on the GPU involves:

  • leaving room on the GPU for the compute buffer. (grows with square of batch size. suffering during long context pre-generation increases with 1/batch size. get those tavern settings right while the story's short :) )
  • leaving room on the GPU for the linear layers.

but much less does it mean (i.e you don't need to:

  • leaving room on the GPU for the KV cache. This is what context length sets the size of. It is a record of attention,
  • leave room on the GPU for the Final Layer. Formerly these layers were separate (3 of them iirc) and had names. Now all I know is that the attention weights are in there.

Without getting technical (because this got changed - there used to be 3 of these layers? IIRC? Is one of them lm_head? does it get bigger if you --leave-output-tensor? (??) ), keeping these two (the KV cache) and the Last Layer (seriously subtract 1 from the total number of layers you see when you do a full offload - or however many non-repeating non-linear layers there are) on RAM together doesn't add much slowdown compared to just one or the other**. But both are very large and fairly light on CPU calculations, relatively.

This matters more with bigger models (where there are more layers) with deeper quantization (where the layers are smaller in tersm of memory usage) because these other two* become bigger and bigger contributors.

none of this would matter at all if llama.cpp would grow some DeepSpeed style architectural Grit and start shuffling the actual parameters out at inference time. Especially when the models have so many layers and they're each so tiny - is moving a few 50MB IQ2 layer from a 120B model up and down the PCIe bus once per token really too slow to countenance? show us your war face and 842 them as well. yes YES. ROOFLINE IT BROTHER. HIT THAT ARITHMETIC INTENSITY LEVER AND JUST BROTLI-G RAW BF16 OUT OF NVME UNTIL 90% OF THE GPU IS DOING DECOMPRESSION YOU FILT-

aight peace

*for mixtral none of this is really more true than it is of mistral. you won't save gigabytes here if you wouldn't with mistral. KV and Last Layer both.
**i don't have a memory of explicitly testing this but I remember quickly learning it just by fiddling trying to fit a miqu on a 3090

@complexinteractive
Copy link

I see this chart is for Mistral 7b. Would there be a meaningful difference in the same chart done with a larger model, perhaps a 70b? It's my understanding that performance at low BPW scales up with parameter count, so I'd be curious to see how the graph changes.

@Artefact2
Copy link
Author

@complexinteractive Good question. Probably similar, but I don't have the hardware to generate KL-divergence on a 70b model.

@Mayorc1978
Copy link

@complexinteractive Good question. Probably similar, but I don't have the hardware to generate KL-divergence on a 70b model.

Any chance on seeing you generate KL-divergence stats for models in range (30B to Mixtral)?

@Artefact2
Copy link
Author

Artefact2 commented Mar 18, 2024

Running unquantized mixtral would take over 180 GB of memory, which I don't have (GGUF can't store tensors in BF16).

Something like Command-R might work, after ggerganov/llama.cpp#6104 is fixed.

@diimdeep
Copy link

diimdeep commented Mar 28, 2024

Awesome,
could you please, also graph inference speed on CPU and GPU across different quantizations, I observe that IQ3_S much slower than Q5_K_M on x86 CPU

@franva
Copy link

franva commented Apr 20, 2024

@Artefact2 Thanks for your document.

I am beginner in Quantization, I have a hard time to figuring out what does the "I", "K", "M" mean in LLM model names.
e.g. IQ3_M, IQ3_XXS, Q3_K_M

I can guess Q means: Quantization, but what about M, XS, XXS are they an indicator for size?(what size??)
How about "I" and "K"?

Appreciate if you could explain them in a very plain language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment