Skip to content

Instantly share code, notes, and snippets.

@simonw
Created November 18, 2025 17:50
Show Gist options
  • Select an option

  • Save simonw/ea7d52706557528e7eb3912cdf9250b0 to your computer and use it in GitHub Desktop.

Select an option

Save simonw/ea7d52706557528e7eb3912cdf9250b0 to your computer and use it in GitHub Desktop.

2025-11-18T17:42:17 conversation: 01kac0zqnbt6zcc8kskb953h0j id: 01kac0ysfcyvmtbg2d9xh2yhg8

Model: gemini/gemini-3-pro-preview (resolved: gemini-3-pro-preview)

Prompt

Alt text for this image, include all figures and make them comprehensible to a screen reader user

Attachments

  1. image/jpeg: https://static.simonwillison.net/static/2025/gemini-3-benchmarks.jpg

Response

A comprehensive table comparing four AI models across various performance benchmarks. The columns listed are: Benchmark Name, Description, Gemini 3 Pro (highlighted in blue), Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1.

Humanity's Last Exam (Academic reasoning)

  • No tools: Gemini 3 Pro 37.5%, Gemini 2.5 Pro 21.6%, Claude Sonnet 4.5 13.7%, GPT-5.1 26.5%.
  • With search and code execution: Gemini 3 Pro 45.8% (others have no data).

ARC-AGI-2 (Visual reasoning puzzles; ARC Prize Verified)

  • Gemini 3 Pro 31.1%, Gemini 2.5 Pro 4.9%, Claude Sonnet 4.5 13.6%, GPT-5.1 17.6%.

GPQA Diamond (Scientific knowledge; No tools)

  • Gemini 3 Pro 91.9%, Gemini 2.5 Pro 86.4%, Claude Sonnet 4.5 83.4%, GPT-5.1 88.1%.

AIME 2025 (Mathematics)

  • No tools: Gemini 3 Pro 95.0%, Gemini 2.5 Pro 88.0%, Claude Sonnet 4.5 87.0%, GPT-5.1 94.0%.
  • With code execution: Gemini 3 Pro 100%, Claude Sonnet 4.5 100%.

MathArena Apex (Challenging Math Contest problems)

  • Gemini 3 Pro 23.4%, Gemini 2.5 Pro 0.5%, Claude Sonnet 4.5 1.6%, GPT-5.1 1.0%.

MMMU-Pro (Multimodal understanding and reasoning)

  • Gemini 3 Pro 81.0%, Gemini 2.5 Pro 68.0%, Claude Sonnet 4.5 68.0%, GPT-5.1 76.0%.

ScreenSpot-Pro (Screen understanding)

  • Gemini 3 Pro 72.7%, Gemini 2.5 Pro 11.4%, Claude Sonnet 4.5 36.2%, GPT-5.1 3.5%.

CharXiv Reasoning (Information synthesis from complex charts)

  • Gemini 3 Pro 81.4%, Gemini 2.5 Pro 69.6%, Claude Sonnet 4.5 68.5%, GPT-5.1 69.5%.

OmniDocBench 1.5 (OCR; Overall Edit Distance, lower is better)

  • Gemini 3 Pro 0.115, Gemini 2.5 Pro 0.145, Claude Sonnet 4.5 0.145, GPT-5.1 0.147.

Video-MMMU (Knowledge acquisition from videos)

  • Gemini 3 Pro 87.6%, Gemini 2.5 Pro 83.6%, Claude Sonnet 4.5 77.8%, GPT-5.1 80.4%.

LiveCodeBench Pro (Competitive coding problems; Elo Rating, higher is better)

  • Gemini 3 Pro 2,439; Gemini 2.5 Pro 1,775; Claude Sonnet 4.5 1,418; GPT-5.1 2,243.

Terminal-Bench 2.0 (Agentic terminal coding; Terminus-2 agent)

  • Gemini 3 Pro 54.2%, Gemini 2.5 Pro 32.6%, Claude Sonnet 4.5 42.8%, GPT-5.1 47.6%.

SWE-Bench Verified (Agentic coding; Single attempt)

  • Gemini 3 Pro 76.2%, Gemini 2.5 Pro 59.6%, Claude Sonnet 4.5 77.2%, GPT-5.1 76.3%.

t2-bench (Agentic tool use)

  • Gemini 3 Pro 85.4%, Gemini 2.5 Pro 54.9%, Claude Sonnet 4.5 84.7%, GPT-5.1 80.2%.

Vending-Bench 2 (Long-horizon agentic tasks; Net worth (mean), higher is better)

  • Gemini 3 Pro $5,478.16; Gemini 2.5 Pro $573.64; Claude Sonnet 4.5 $3,838.74; GPT-5.1 $1,473.43.

FACTS Benchmark Suite (Held out internal grounding, parametric, MM, and search retrieval benchmarks)

  • Gemini 3 Pro 70.5%, Gemini 2.5 Pro 63.4%, Claude Sonnet 4.5 50.4%, GPT-5.1 50.8%.

SimpleQA Verified (Parametric knowledge)

  • Gemini 3 Pro 72.1%, Gemini 2.5 Pro 54.5%, Claude Sonnet 4.5 29.3%, GPT-5.1 34.9%.

MMMLU (Multilingual Q&A)

  • Gemini 3 Pro 91.8%, Gemini 2.5 Pro 89.5%, Claude Sonnet 4.5 89.1%, GPT-5.1 91.0%.

Global PIQA (Commonsense reasoning across 100 Languages and Cultures)

  • Gemini 3 Pro 93.4%, Gemini 2.5 Pro 91.5%, Claude Sonnet 4.5 90.1%, GPT-5.1 90.9%.

MRCR v2 (8-needle) (Long context performance)

  • 128k (average): Gemini 3 Pro 77.0%, Gemini 2.5 Pro 58.0%, Claude Sonnet 4.5 47.1%, GPT-5.1 61.6%.
  • 1M (pointwise): Gemini 3 Pro 26.3%, Gemini 2.5 Pro 16.4%, Claude Sonnet 4.5 (not supported), GPT-5.1 (not supported).

Token usage

1,105 input, 3,901 output, {"candidatesTokenCount": 1486, "promptTokensDetails": [{"modality": "IMAGE", "tokenCount": 1085}, {"modality": "TEXT", "tokenCount": 20}], "thoughtsTokenCount": 2415}

2025-11-18T17:48:43

Prompt

Convert to JSON

Response

{
  "metadata": {
    "columns": [
      "Benchmark",
      "Description",
      "Gemini 3 Pro",
      "Gemini 2.5 Pro",
      "Claude Sonnet 4.5",
      "GPT-5.1"
    ]
  },
  "benchmarks": [
    {
      "name": "Humanity's Last Exam",
      "description": "Academic reasoning",
      "sub_results": [
        {
          "condition": "No tools",
          "gemini_3_pro": "37.5%",
          "gemini_2_5_pro": "21.6%",
          "claude_sonnet_4_5": "13.7%",
          "gpt_5_1": "26.5%"
        },
        {
          "condition": "With search and code execution",
          "gemini_3_pro": "45.8%",
          "gemini_2_5_pro": null,
          "claude_sonnet_4_5": null,
          "gpt_5_1": null
        }
      ]
    },
    {
      "name": "ARC-AGI-2",
      "description": "Visual reasoning puzzles",
      "notes": "ARC Prize Verified",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "31.1%",
          "gemini_2_5_pro": "4.9%",
          "claude_sonnet_4_5": "13.6%",
          "gpt_5_1": "17.6%"
        }
      ]
    },
    {
      "name": "GPQA Diamond",
      "description": "Scientific knowledge",
      "sub_results": [
        {
          "condition": "No tools",
          "gemini_3_pro": "91.9%",
          "gemini_2_5_pro": "86.4%",
          "claude_sonnet_4_5": "83.4%",
          "gpt_5_1": "88.1%"
        }
      ]
    },
    {
      "name": "AIME 2025",
      "description": "Mathematics",
      "sub_results": [
        {
          "condition": "No tools",
          "gemini_3_pro": "95.0%",
          "gemini_2_5_pro": "88.0%",
          "claude_sonnet_4_5": "87.0%",
          "gpt_5_1": "94.0%"
        },
        {
          "condition": "With code execution",
          "gemini_3_pro": "100%",
          "gemini_2_5_pro": null,
          "claude_sonnet_4_5": "100%",
          "gpt_5_1": null
        }
      ]
    },
    {
      "name": "MathArena Apex",
      "description": "Challenging Math Contest problems",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "23.4%",
          "gemini_2_5_pro": "0.5%",
          "claude_sonnet_4_5": "1.6%",
          "gpt_5_1": "1.0%"
        }
      ]
    },
    {
      "name": "MMMU-Pro",
      "description": "Multimodal understanding and reasoning",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "81.0%",
          "gemini_2_5_pro": "68.0%",
          "claude_sonnet_4_5": "68.0%",
          "gpt_5_1": "76.0%"
        }
      ]
    },
    {
      "name": "ScreenSpot-Pro",
      "description": "Screen understanding",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "72.7%",
          "gemini_2_5_pro": "11.4%",
          "claude_sonnet_4_5": "36.2%",
          "gpt_5_1": "3.5%"
        }
      ]
    },
    {
      "name": "CharXiv Reasoning",
      "description": "Information synthesis from complex charts",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "81.4%",
          "gemini_2_5_pro": "69.6%",
          "claude_sonnet_4_5": "68.5%",
          "gpt_5_1": "69.5%"
        }
      ]
    },
    {
      "name": "OmniDocBench 1.5",
      "description": "OCR",
      "notes": "Overall Edit Distance, lower is better",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "0.115",
          "gemini_2_5_pro": "0.145",
          "claude_sonnet_4_5": "0.145",
          "gpt_5_1": "0.147"
        }
      ]
    },
    {
      "name": "Video-MMMU",
      "description": "Knowledge acquisition from videos",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "87.6%",
          "gemini_2_5_pro": "83.6%",
          "claude_sonnet_4_5": "77.8%",
          "gpt_5_1": "80.4%"
        }
      ]
    },
    {
      "name": "LiveCodeBench Pro",
      "description": "Competitive coding problems from Codeforces, ICPC, and IOI",
      "notes": "Elo Rating, higher is better",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "2,439",
          "gemini_2_5_pro": "1,775",
          "claude_sonnet_4_5": "1,418",
          "gpt_5_1": "2,243"
        }
      ]
    },
    {
      "name": "Terminal-Bench 2.0",
      "description": "Agentic terminal coding",
      "notes": "Terminus-2 agent",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "54.2%",
          "gemini_2_5_pro": "32.6%",
          "claude_sonnet_4_5": "42.8%",
          "gpt_5_1": "47.6%"
        }
      ]
    },
    {
      "name": "SWE-Bench Verified",
      "description": "Agentic coding",
      "notes": "Single attempt",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "76.2%",
          "gemini_2_5_pro": "59.6%",
          "claude_sonnet_4_5": "77.2%",
          "gpt_5_1": "76.3%"
        }
      ]
    },
    {
      "name": "t2-bench",
      "description": "Agentic tool use",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "85.4%",
          "gemini_2_5_pro": "54.9%",
          "claude_sonnet_4_5": "84.7%",
          "gpt_5_1": "80.2%"
        }
      ]
    },
    {
      "name": "Vending-Bench 2",
      "description": "Long-horizon agentic tasks",
      "notes": "Net worth (mean), higher is better",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "$5,478.16",
          "gemini_2_5_pro": "$573.64",
          "claude_sonnet_4_5": "$3,838.74",
          "gpt_5_1": "$1,473.43"
        }
      ]
    },
    {
      "name": "FACTS Benchmark Suite",
      "description": "Held out internal grounding, parametric, MM, and search retrieval benchmarks",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "70.5%",
          "gemini_2_5_pro": "63.4%",
          "claude_sonnet_4_5": "50.4%",
          "gpt_5_1": "50.8%"
        }
      ]
    },
    {
      "name": "SimpleQA Verified",
      "description": "Parametric knowledge",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "72.1%",
          "gemini_2_5_pro": "54.5%",
          "claude_sonnet_4_5": "29.3%",
          "gpt_5_1": "34.9%"
        }
      ]
    },
    {
      "name": "MMMLU",
      "description": "Multilingual Q&A",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "91.8%",
          "gemini_2_5_pro": "89.5%",
          "claude_sonnet_4_5": "89.1%",
          "gpt_5_1": "91.0%"
        }
      ]
    },
    {
      "name": "Global PIQA",
      "description": "Commonsense reasoning across 100 Languages and Cultures",
      "sub_results": [
        {
          "condition": "Default",
          "gemini_3_pro": "93.4%",
          "gemini_2_5_pro": "91.5%",
          "claude_sonnet_4_5": "90.1%",
          "gpt_5_1": "90.9%"
        }
      ]
    },
    {
      "name": "MRCR v2 (8-needle)",
      "description": "Long context performance",
      "sub_results": [
        {
          "condition": "128k (average)",
          "gemini_3_pro": "77.0%",
          "gemini_2_5_pro": "58.0%",
          "claude_sonnet_4_5": "47.1%",
          "gpt_5_1": "61.6%"
        },
        {
          "condition": "1M (pointwise)",
          "gemini_3_pro": "26.3%",
          "gemini_2_5_pro": "16.4%",
          "claude_sonnet_4_5": "not supported",
          "gpt_5_1": "not supported"
        }
      ]
    }
  ]
}

Token usage

2,596 input, 5,400 output, {"candidatesTokenCount": 2862, "promptTokensDetails": [{"modality": "TEXT", "tokenCount": 1511}, {"modality": "IMAGE", "tokenCount": 1085}], "thoughtsTokenCount": 2538}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment