jerzydziewierz/ollama_cost_effects.md

## ollama_cost_effects.md

      
    Raw
  

              ollama_cost_effects.md
            
          
    Data collected on 2024-01-13 using vast.ai ; local price of electricity est. at $0.34 per kWh, and usage at 500W
Performance in tokens/seconds

How fast is given hardware generates your tokens?
method : ask ollama run X {query} where query is ~50 tokens, and result is ~650 tokens.
query: What would Alan Turing think about Large Language Models? Explain with a lot of details and examples. 


1.6B dolphin-phi
10B llama-pro
46B dolphin-mixtral
120B megadolphin


Local 2060
@500W->$0.17/h
12GB
103.7
+45.9
++6.2
N/A


1x RTX 3070
$0.11/h
8GB
120.6
+57.5
++2.1 slow
N/A


1x A40
$0.77/h
45GB
133.4
+72.2
+43.8
++1.2 slow


1x L40
$1.12/h
46GB
174.2
+92.3
+56.3
N/A


A100_SXM4 
$0.90/h
80GB
148.5
+99.1
+58.0
+14.1


------------------------------------
---
---
---
---


H100_PCIe
$2.85/h
80GB
124.8
+88.0
+36.6
+14.0


2x A40
$0.80/h
90GB
+98.4
+52.9
+38.5
++8.9


Note: the point of evaluating 2xA40 is to see if a very big model (~68GB) will work on dual GPU. Result: it works, but slower, and not always cost-effective -- it depends on the market prices at startup time.
Cost effectiveness, model and hardware

How many tokens can you generate for $100 ?


rented hardware
1.6B dolphin-phi
10B llama-pro
46B dolphin-mixtral
120B megadolphin


local 2060-12GB@500W
-226.6M
--97.2M
--13.1M
N/A


1x 3070
-391.1M
-186.4M
---6.7M
N/A


1x A40
--46.4M
--33.7M
--20.4M
N/A


1x L40
--55.9M
--29.6M
--18.1M
N/A


2x A40
--43.8M
--23.7M
--17.3M
---4.0M


A100_SXM4
--59.4M
--39.6M
--23.2M
---5.6M


H100_PCIe
--15.7M
--11.1M
---4.6M
---1.7M


API and provider                                       
tokens per $100


mixtral-medium 
 on original mixtral site          
--12.18M  


mixtral 8x7B 
 on fireworks                         
--62.50M


gpt-3.5-turbo-1106 
 aka ChatGPT 3.5               
--51.80M  


gpt-4-1106-preview 
 aka GPT4-Turbo                
---3.50M  


gpt-4-32k API
 aka Best GPT4 ever                  
---0.86M


https://app.fireworks.ai/pricing
Cost of $1/hour for a month


is  $774 per month


is $8928 per year


Cost to generate 1T tokens

TinyLlama is trained on approx. 3T tokens. What would it take to prepare(refine) these tokens using an LLM?
Note: for large-scale generation, the cost can possibly go down by approx. 4x thanks to batching, caching and other tricks. Still, this is a first order approximation of the magnitude of the effort required.


condition
result


at   0.8M/$100
$ 125.0M


at   5.6M/$100
$  17.8M


at  39.4M/$100
$   2.5M


at 186.4M/$100
$   0.5M
	1.6B dolphin-phi	10B llama-pro	46B dolphin-mixtral	120B megadolphin
Local 2060 @500W->$0.17/h 12GB	103.7	+45.9	++6.2	N/A
`1x RTX 3070` $0.11/h 8GB	120.6	+57.5	++2.1 slow	N/A
`1x A40` $0.77/h 45GB	133.4	+72.2	+43.8	++1.2 slow
`1x L40` $1.12/h 46GB	174.2	+92.3	+56.3	N/A
`A100_SXM4` $0.90/h 80GB	148.5	+99.1	+58.0	+14.1
------------------------------------	---	---	---	---
`H100_PCIe` $2.85/h 80GB	124.8	+88.0	+36.6	+14.0
`2x A40` $0.80/h 90GB	+98.4	+52.9	+38.5	++8.9
`rented hardware`	1.6B `dolphin-phi`	10B `llama-pro`	46B `dolphin-mixtral`	120B `megadolphin`
`local 2060-12GB@500W`	`-226.6M`	`--97.2M`	`--13.1M`	`N/A`
`1x 3070`	`-391.1M`	`-186.4M`	`---6.7M`	`N/A`
`1x A40`	`--46.4M`	`--33.7M`	`--20.4M`	`N/A`
`1x L40`	`--55.9M`	`--29.6M`	`--18.1M`	`N/A`
`2x A40`	`--43.8M`	`--23.7M`	`--17.3M`	`---4.0M`
`A100_SXM4`	`--59.4M`	`--39.6M`	`--23.2M`	`---5.6M`
`H100_PCIe`	`--15.7M`	`--11.1M`	`---4.6M`	`---1.7M`
`API` and provider	tokens per $100
`mixtral-medium` on original mixtral site	`--12.18M`
`mixtral 8x7B` on fireworks	`--62.50M`
`gpt-3.5-turbo-1106` aka ChatGPT 3.5	`--51.80M`
`gpt-4-1106-preview` aka GPT4-Turbo	`---3.50M`
`gpt-4-32k API` aka Best GPT4 ever	`---0.86M`
condition	result
at 0.8M/$100	`$ 125.0M`
at 5.6M/$100	`$ 17.8M`
at 39.4M/$100	`$ 2.5M`
at 186.4M/$100	`$ 0.5M`