Skip to content

Instantly share code, notes, and snippets.

View unhaya's full-sized avatar

unyaya unhaya

  • Aoba Planning Co., Ltd.
  • japan
  • X @haassiy
View GitHub Profile
@unhaya
unhaya / README.md
Created May 13, 2026 12:01
gemini-3.1-flash-lite-preview: thinking_budget is a soft hint, max_output_tokens is the real ceiling (14x reduction observed)

thinking_budget is a lie — max_output_tokens is the real ceiling

Model: gemini-3.1-flash-lite-preview Date: 2026-05-13 Context: Final tuning before beta distribution of a Japanese DTP proofreading tool.

TL;DR

Setting thinking_budget=2048 does not cap thinking tokens. With max_output_tokens=8192, Gemini consumed 7,862 thinking tokens for a 279-character input. Lowering max_output_tokens to 2,048 collapsed thinking to 560 tokens — a 14× reduction with identical detection results. The model uses max_output_tokens as the actual ceiling and decides "how long to think" based on the available headroom.