Skip to content

Instantly share code, notes, and snippets.

@jerzydziewierz
Last active August 23, 2024 06:42
Show Gist options
  • Save jerzydziewierz/3b4a169c8d7cba89e18f613b32c3f52b to your computer and use it in GitHub Desktop.
Save jerzydziewierz/3b4a169c8d7cba89e18f613b32c3f52b to your computer and use it in GitHub Desktop.
ollama cost of inference vs openai

Data collected on 2024-01-13 using vast.ai ; local price of electricity est. at $0.34 per kWh, and usage at 500W

Performance in tokens/seconds

How fast is given hardware generates your tokens?

method : ask ollama run X {query} where query is ~50 tokens, and result is ~650 tokens.

query: What would Alan Turing think about Large Language Models? Explain with a lot of details and examples.

1.6B dolphin-phi 10B llama-pro 46B dolphin-mixtral 120B megadolphin
Local 2060
@500W->$0.17/h
12GB
103.7 +45.9 ++6.2 N/A
1x RTX 3070
$0.11/h
8GB
120.6 +57.5 ++2.1 slow N/A
1x A40
$0.77/h
45GB
133.4 +72.2 +43.8 ++1.2 slow
1x L40
$1.12/h
46GB
174.2 +92.3 +56.3 N/A
A100_SXM4
$0.90/h
80GB
148.5 +99.1 +58.0 +14.1
------------------------------------ --- --- --- ---
H100_PCIe
$2.85/h
80GB
124.8 +88.0 +36.6 +14.0
2x A40
$0.80/h
90GB
+98.4 +52.9 +38.5 ++8.9

Note: the point of evaluating 2xA40 is to see if a very big model (~68GB) will work on dual GPU. Result: it works, but slower, and not always cost-effective -- it depends on the market prices at startup time.

Cost effectiveness, model and hardware

How many tokens can you generate for $100 ?

rented hardware 1.6B dolphin-phi 10B llama-pro 46B dolphin-mixtral 120B megadolphin
local 2060-12GB@500W -226.6M --97.2M --13.1M N/A
1x 3070 -391.1M -186.4M ---6.7M N/A
1x A40 --46.4M --33.7M --20.4M N/A
1x L40 --55.9M --29.6M --18.1M N/A
2x A40 --43.8M --23.7M --17.3M ---4.0M
A100_SXM4 --59.4M --39.6M --23.2M ---5.6M
H100_PCIe --15.7M --11.1M ---4.6M ---1.7M
API and provider                                        tokens per $100
mixtral-medium
on original mixtral site          
--12.18M  
mixtral 8x7B
on fireworks     
--62.50M
gpt-3.5-turbo-1106
aka ChatGPT 3.5               
--51.80M  
gpt-4-1106-preview
aka GPT4-Turbo                
---3.50M  
gpt-4-32k API
aka Best GPT4 ever                  
---0.86M

https://app.fireworks.ai/pricing

Cost of $1/hour for a month

  • is $774 per month

  • is $8928 per year

Cost to generate 1T tokens

TinyLlama is trained on approx. 3T tokens. What would it take to prepare(refine) these tokens using an LLM?

Note: for large-scale generation, the cost can possibly go down by approx. 4x thanks to batching, caching and other tricks. Still, this is a first order approximation of the magnitude of the effort required.

condition result
at 0.8M/$100 $ 125.0M
at 5.6M/$100 $ 17.8M
at 39.4M/$100 $ 2.5M
at 186.4M/$100 $ 0.5M
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment