Model: gemini/gemini-3-pro-preview (resolved: gemini-3-pro-preview)
Alt text for this image, include all figures and make them comprehensible to a screen reader user
A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.
Humanity's Last Exam (Academic reasoning)
- No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.
- With search and code execution: Gemini 3 Pro 45.8% (others have no data).
ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)
- Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.
GPQA Diamond (Scientific knowledge; No tools)
- Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.
AIME 2025 (Mathematics)
- No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.
- With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.
MathArena Apex (Challenging Math Contest problems)
- Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.
MMMU-Pro (Multimodal understanding and reasoning)
- Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.
ScreenSpot-Pro (Screen understanding)
- Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.
CharXiv Reasoning (Information synthesis from complex charts)
- Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.
OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)
- Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.
Video-MMMU (Knowledge acquisition from videos)
- Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.
LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)
- Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.
Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)
- Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.
SWE-Bench Verified (Agentic coding; Single attempt)
- Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.
t2-bench (Agentic tool use)
- Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.
Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)
- Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.
FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)
- Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.
SimpleQA Verified (Parametric knowledge)
- Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.
MMMLU (Multilingual Q&A)
- Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.
Global PIQA (Commonsense reasoning across 100 Languages and Cultures)
- Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.
MRCR v2 (8-needle) (Long context performance)
- 128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.
- 1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).
1,105 input, 3,901 output, {"candidatesTokenCount": 1486, "promptTokensDetails": [{"modality": "IMAGE", "tokenCount": 1085}, {"modality": "TEXT", "tokenCount": 20}], "thoughtsTokenCount": 2415}
Convert to JSON
{
"metadata": {
"columns": [
"Benchmark",
"Description",
"Gemini 3 Pro",
"Gemini 2.5 Pro",
"Claude Sonnet 4.5",
"GPT-5.1"
]
},
"benchmarks": [
{
"name": "Humanity's Last Exam",
"description": "Academic reasoning",
"sub_results": [
{
"condition": "No tools",
"gemini_3_pro": "37.5%",
"gemini_2_5_pro": "21.6%",
"claude_sonnet_4_5": "13.7%",
"gpt_5_1": "26.5%"
},
{
"condition": "With search and code execution",
"gemini_3_pro": "45.8%",
"gemini_2_5_pro": null,
"claude_sonnet_4_5": null,
"gpt_5_1": null
}
]
},
{
"name": "ARC-AGI-2",
"description": "Visual reasoning puzzles",
"notes": "ARC Prize Verified",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "31.1%",
"gemini_2_5_pro": "4.9%",
"claude_sonnet_4_5": "13.6%",
"gpt_5_1": "17.6%"
}
]
},
{
"name": "GPQA Diamond",
"description": "Scientific knowledge",
"sub_results": [
{
"condition": "No tools",
"gemini_3_pro": "91.9%",
"gemini_2_5_pro": "86.4%",
"claude_sonnet_4_5": "83.4%",
"gpt_5_1": "88.1%"
}
]
},
{
"name": "AIME 2025",
"description": "Mathematics",
"sub_results": [
{
"condition": "No tools",
"gemini_3_pro": "95.0%",
"gemini_2_5_pro": "88.0%",
"claude_sonnet_4_5": "87.0%",
"gpt_5_1": "94.0%"
},
{
"condition": "With code execution",
"gemini_3_pro": "100%",
"gemini_2_5_pro": null,
"claude_sonnet_4_5": "100%",
"gpt_5_1": null
}
]
},
{
"name": "MathArena Apex",
"description": "Challenging Math Contest problems",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "23.4%",
"gemini_2_5_pro": "0.5%",
"claude_sonnet_4_5": "1.6%",
"gpt_5_1": "1.0%"
}
]
},
{
"name": "MMMU-Pro",
"description": "Multimodal understanding and reasoning",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "81.0%",
"gemini_2_5_pro": "68.0%",
"claude_sonnet_4_5": "68.0%",
"gpt_5_1": "76.0%"
}
]
},
{
"name": "ScreenSpot-Pro",
"description": "Screen understanding",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "72.7%",
"gemini_2_5_pro": "11.4%",
"claude_sonnet_4_5": "36.2%",
"gpt_5_1": "3.5%"
}
]
},
{
"name": "CharXiv Reasoning",
"description": "Information synthesis from complex charts",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "81.4%",
"gemini_2_5_pro": "69.6%",
"claude_sonnet_4_5": "68.5%",
"gpt_5_1": "69.5%"
}
]
},
{
"name": "OmniDocBench 1.5",
"description": "OCR",
"notes": "Overall Edit Distance, lower is better",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "0.115",
"gemini_2_5_pro": "0.145",
"claude_sonnet_4_5": "0.145",
"gpt_5_1": "0.147"
}
]
},
{
"name": "Video-MMMU",
"description": "Knowledge acquisition from videos",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "87.6%",
"gemini_2_5_pro": "83.6%",
"claude_sonnet_4_5": "77.8%",
"gpt_5_1": "80.4%"
}
]
},
{
"name": "LiveCodeBench Pro",
"description": "Competitive coding problems from Codeforces, ICPC, and IOI",
"notes": "Elo Rating, higher is better",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "2,439",
"gemini_2_5_pro": "1,775",
"claude_sonnet_4_5": "1,418",
"gpt_5_1": "2,243"
}
]
},
{
"name": "Terminal-Bench 2.0",
"description": "Agentic terminal coding",
"notes": "Terminus-2 agent",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "54.2%",
"gemini_2_5_pro": "32.6%",
"claude_sonnet_4_5": "42.8%",
"gpt_5_1": "47.6%"
}
]
},
{
"name": "SWE-Bench Verified",
"description": "Agentic coding",
"notes": "Single attempt",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "76.2%",
"gemini_2_5_pro": "59.6%",
"claude_sonnet_4_5": "77.2%",
"gpt_5_1": "76.3%"
}
]
},
{
"name": "t2-bench",
"description": "Agentic tool use",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "85.4%",
"gemini_2_5_pro": "54.9%",
"claude_sonnet_4_5": "84.7%",
"gpt_5_1": "80.2%"
}
]
},
{
"name": "Vending-Bench 2",
"description": "Long-horizon agentic tasks",
"notes": "Net worth (mean), higher is better",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "$5,478.16",
"gemini_2_5_pro": "$573.64",
"claude_sonnet_4_5": "$3,838.74",
"gpt_5_1": "$1,473.43"
}
]
},
{
"name": "FACTS Benchmark Suite",
"description": "Held out internal grounding, parametric, MM, and search retrieval benchmarks",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "70.5%",
"gemini_2_5_pro": "63.4%",
"claude_sonnet_4_5": "50.4%",
"gpt_5_1": "50.8%"
}
]
},
{
"name": "SimpleQA Verified",
"description": "Parametric knowledge",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "72.1%",
"gemini_2_5_pro": "54.5%",
"claude_sonnet_4_5": "29.3%",
"gpt_5_1": "34.9%"
}
]
},
{
"name": "MMMLU",
"description": "Multilingual Q&A",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "91.8%",
"gemini_2_5_pro": "89.5%",
"claude_sonnet_4_5": "89.1%",
"gpt_5_1": "91.0%"
}
]
},
{
"name": "Global PIQA",
"description": "Commonsense reasoning across 100 Languages and Cultures",
"sub_results": [
{
"condition": "Default",
"gemini_3_pro": "93.4%",
"gemini_2_5_pro": "91.5%",
"claude_sonnet_4_5": "90.1%",
"gpt_5_1": "90.9%"
}
]
},
{
"name": "MRCR v2 (8-needle)",
"description": "Long context performance",
"sub_results": [
{
"condition": "128k (average)",
"gemini_3_pro": "77.0%",
"gemini_2_5_pro": "58.0%",
"claude_sonnet_4_5": "47.1%",
"gpt_5_1": "61.6%"
},
{
"condition": "1M (pointwise)",
"gemini_3_pro": "26.3%",
"gemini_2_5_pro": "16.4%",
"claude_sonnet_4_5": "not supported",
"gpt_5_1": "not supported"
}
]
}
]
}2,596 input, 5,400 output, {"candidatesTokenCount": 2862, "promptTokensDetails": [{"modality": "TEXT", "tokenCount": 1511}, {"modality": "IMAGE", "tokenCount": 1085}], "thoughtsTokenCount": 2538}