Ioannis Papapanagiotou ipapapa

## rag_math.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                ipapapa
                / rag_math.md
            
            
              Last active
              January 2, 2026 19:59
            
          
    The Cost Formula

Total Cost = (Input Tokens × Rate_in) + (Output Tokens × Rate_out) + Rerank Fee
For our standard 6,100-token turn (5,300 input / 800 output):

Gemini Input: 5,300 tokens × $0.50/1M = $0.00265
Gemini Output: 800 tokens × $3.00/1M = $0.00240
Cohere Rerank: 1 search × $0.002/req = $0.00200
Total Pipeline Cost: $0.00705 / query


## rag_table1.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                ipapapa
                / rag_table1.md
            
            
              Created
              January 2, 2026 19:03
            
          
    Table 1: CPU RAG Benchmarks (The "Death Spiral")


Hardware
Model
RAG Response Time
Verdict


2 vCPU
Qwen 4B
~15–18 mins
💀 System Failure


4 vCPU
Qwen 4B
~10–12 mins
💀 System Failure


8 vCPU
Qwen 8B
~7–10 mins
💀 System Failure


## rag_table_2.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                ipapapa
                / rag_table_2.md
            
            
              Created
              January 2, 2026 19:02
            
          
    Table 2: Accuracy Benchmarks (Ground Truth Test)


Model
Context Window
Accuracy
Issues


Gemini 3.0 Flash
1M+
100%
Perfect citations. Identified all items.


Qwen 3 (H100)
131k
71%
Missed 2 valid items.


Qwen 3 (H100)
4k
29%
Severe hallucinations.


## rag_table3.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                ipapapa
                / rag_table3.md
            
            
              Created
              January 2, 2026 19:01
            
          
    Table 3: The Final Verdict


Metric
Local H100
Gemini 3.0 Flash
Winner


Response Time
45 seconds
2 seconds
☁️ Cloud


Accuracy
71%
100%
☁️ Cloud


Break-Even Volume
> 16,000/day
N/A
☁️ Cloud


## rag_cost.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                ipapapa
                / rag_cost.md
            
            
              Created
              January 2, 2026 19:00
            
          
    Cost Analysis: Managed API vs. Rented H100


Metric
Managed Architecture
(Gemini 3.0 Flash + Cohere)
Rented Infrastructure
(DigitalOcean H100 Droplet)


Billing Model
Pay-as-you-go (Utility)
Fixed Hourly (Subscription)


Model Cost
$0.50 / 1M input | $3.00 / 1M output
~$2,440.00 / month ($3.39/hr)


Retrieval Cost
$2.00 / 1k queries (Cohere Rerank)
$2.00 / 1k queries (Cohere Rerank)


Cost per Query
$0.00705
$0.00200 (Retrieval only)


Daily Floor Cost
$0.00 
(Pay only for what you use)
$81.36 
(Assuming 24/7 uptime)


Break-Even Point
—
~16,110 Queries / Day


## rag_formula.py
def calculate_break_even():
    """
    Calculates the volume of queries needed to justify renting an H100
    vs. using Gemini 3.0 Flash API.
    """

    # 1. Fixed Infrastructure Costs (Monthly)
    h100_server_cost = 2440.00

    # 2. Variable Costs per Query
Hardware	Model	RAG Response Time	Verdict
2 vCPU	Qwen 4B	~15–18 mins	💀 System Failure
4 vCPU	Qwen 4B	~10–12 mins	💀 System Failure
8 vCPU	Qwen 8B	~7–10 mins	💀 System Failure
Model	Context Window	Accuracy	Issues
Gemini 3.0 Flash	1M+	100%	Perfect citations. Identified all items.
Qwen 3 (H100)	131k	71%	Missed 2 valid items.
Qwen 3 (H100)	4k	29%	Severe hallucinations.
Metric	Local H100	Gemini 3.0 Flash	Winner
Response Time	45 seconds	2 seconds	☁️ Cloud
Accuracy	71%	100%	☁️ Cloud
Break-Even Volume	> 16,000/day	N/A	☁️ Cloud
Metric	Managed Architecture (Gemini 3.0 Flash + Cohere)	Rented Infrastructure (DigitalOcean H100 Droplet)
Billing Model	Pay-as-you-go (Utility)	Fixed Hourly (Subscription)
Model Cost	$0.50 / 1M input \| $3.00 / 1M output	~$2,440.00 / month ($3.39/hr)
Retrieval Cost	$2.00 / 1k queries (Cohere Rerank)	$2.00 / 1k queries (Cohere Rerank)
Cost per Query	$0.00705	$0.00200 (Retrieval only)
Daily Floor Cost	$0.00 (Pay only for what you use)	$81.36 (Assuming 24/7 uptime)
Break-Even Point	—	~16,110 Queries / Day
	def calculate_break_even():
	"""
	Calculates the volume of queries needed to justify renting an H100
	vs. using Gemini 3.0 Flash API.
	"""

	# 1. Fixed Infrastructure Costs (Monthly)
	h100_server_cost = 2440.00

	# 2. Variable Costs per Query