Data collected on 2024-01-13 using vast.ai
; local price of electricity est. at $0.34 per kWh, and usage at 500W
How fast is given hardware generates your tokens?
method : ask ollama run X {query}
where query is ~50 tokens, and result is ~650 tokens.
query: What would Alan Turing think about Large Language Models? Explain with a lot of details and examples.
1.6B dolphin-phi | 10B llama-pro | 46B dolphin-mixtral | 120B megadolphin | |
---|---|---|---|---|
Local 2060 @500W->$0.17/h 12GB |
103.7 | +45.9 | ++6.2 | N/A |
1x RTX 3070 $0.11/h 8GB |
120.6 | +57.5 | ++2.1 slow | N/A |
1x A40 $0.77/h 45GB |
133.4 | +72.2 | +43.8 | ++1.2 slow |
1x L40 $1.12/h 46GB |
174.2 | +92.3 | +56.3 | N/A |
A100_SXM4 $0.90/h 80GB |
148.5 | +99.1 | +58.0 | +14.1 |
------------------------------------ | --- | --- | --- | --- |
H100_PCIe $2.85/h 80GB |
124.8 | +88.0 | +36.6 | +14.0 |
2x A40 $0.80/h 90GB |
+98.4 | +52.9 | +38.5 | ++8.9 |
Note: the point of evaluating 2xA40
is to see if a very big model (~68GB) will work on dual GPU. Result: it works, but slower, and not always cost-effective -- it depends on the market prices at startup time.
How many tokens can you generate for $100 ?
rented hardware |
1.6B dolphin-phi |
10B llama-pro |
46B dolphin-mixtral |
120B megadolphin |
---|---|---|---|---|
local 2060-12GB@500W |
-226.6M |
--97.2M |
--13.1M |
N/A |
1x 3070 |
-391.1M |
-186.4M |
---6.7M |
N/A |
1x A40 |
--46.4M |
--33.7M |
--20.4M |
N/A |
1x L40 |
--55.9M |
--29.6M |
--18.1M |
N/A |
2x A40 |
--43.8M |
--23.7M |
--17.3M |
---4.0M |
A100_SXM4 |
--59.4M |
--39.6M |
--23.2M |
---5.6M |
H100_PCIe |
--15.7M |
--11.1M |
---4.6M |
---1.7M |
API and provider |
tokens per $100 |
---|---|
mixtral-medium on original mixtral site |
--12.18M |
mixtral 8x7B on fireworks |
--62.50M |
gpt-3.5-turbo-1106 aka ChatGPT 3.5 |
--51.80M |
gpt-4-1106-preview aka GPT4-Turbo |
---3.50M |
gpt-4-32k API aka Best GPT4 ever |
---0.86M |
https://app.fireworks.ai/pricing
-
is $774 per month
-
is $8928 per year
TinyLlama is trained on approx. 3T tokens. What would it take to prepare(refine) these tokens using an LLM?
Note: for large-scale generation, the cost can possibly go down by approx. 4x thanks to batching, caching and other tricks. Still, this is a first order approximation of the magnitude of the effort required.
condition | result |
---|---|
at 0.8M/$100 | $ 125.0M |
at 5.6M/$100 | $ 17.8M |
at 39.4M/$100 | $ 2.5M |
at 186.4M/$100 | $ 0.5M |