Skip to content

Instantly share code, notes, and snippets.

View ipapapa's full-sized avatar

Ioannis Papapanagiotou ipapapa

View GitHub Profile

The Cost Formula

Total Cost = (Input Tokens × Rate_in) + (Output Tokens × Rate_out) + Rerank Fee

For our standard 6,100-token turn (5,300 input / 800 output):

  • Gemini Input: 5,300 tokens × $0.50/1M = $0.00265
  • Gemini Output: 800 tokens × $3.00/1M = $0.00240
  • Cohere Rerank: 1 search × $0.002/req = $0.00200
  • Total Pipeline Cost: $0.00705 / query

Table 1: CPU RAG Benchmarks (The "Death Spiral")

Hardware Model RAG Response Time Verdict
2 vCPU Qwen 4B ~15–18 mins 💀 System Failure
4 vCPU Qwen 4B ~10–12 mins 💀 System Failure
8 vCPU Qwen 8B ~7–10 mins 💀 System Failure

Table 2: Accuracy Benchmarks (Ground Truth Test)

Model Context Window Accuracy Issues
Gemini 3.0 Flash 1M+ 100% Perfect citations. Identified all items.
Qwen 3 (H100) 131k 71% Missed 2 valid items.
Qwen 3 (H100) 4k 29% Severe hallucinations.

Table 3: The Final Verdict

Metric Local H100 Gemini 3.0 Flash Winner
Response Time 45 seconds 2 seconds ☁️ Cloud
Accuracy 71% 100% ☁️ Cloud
Break-Even Volume > 16,000/day N/A ☁️ Cloud

Cost Analysis: Managed API vs. Rented H100

Metric Managed Architecture
(Gemini 3.0 Flash + Cohere)
Rented Infrastructure
(DigitalOcean H100 Droplet)
Billing Model Pay-as-you-go (Utility) Fixed Hourly (Subscription)
Model Cost $0.50 / 1M input | $3.00 / 1M output ~$2,440.00 / month ($3.39/hr)
Retrieval Cost $2.00 / 1k queries (Cohere Rerank) $2.00 / 1k queries (Cohere Rerank)
Cost per Query $0.00705 $0.00200 (Retrieval only)
Daily Floor Cost $0.00
(Pay only for what you use)
$81.36
(Assuming 24/7 uptime)
Break-Even Point ~16,110 Queries / Day
def calculate_break_even():
"""
Calculates the volume of queries needed to justify renting an H100
vs. using Gemini 3.0 Flash API.
"""
# 1. Fixed Infrastructure Costs (Monthly)
h100_server_cost = 2440.00
# 2. Variable Costs per Query