Skip to content

Instantly share code, notes, and snippets.

View mnm-matin's full-sized avatar
🏴󠁧󠁢󠁳󠁣󠁴󠁿
Focusing

Matin Mahmood mnm-matin

🏴󠁧󠁢󠁳󠁣󠁴󠁿
Focusing
View GitHub Profile
@mnm-matin
mnm-matin / chunks.py
Last active April 19, 2026 10:50
Opus 4.7 tokenizer: per-chunk in-context cost + side-by-side visualization. SELECT: 2->5. PRIMARY: 1->6. CJK untouched. (via /v1/messages/count_tokens, claude-opus-4-6 vs 4-7)
"""Per-chunk token cost, side by side. Accurate counts via count_tokens.
BPE with pre-tokenization handles whitespace-delimited chunks independently,
so probing each chunk's token cost in isolation gives the true per-chunk
budget. We display each chunk as a tile labelled with its token count,
side by side for 4.6 and 4.7, so readers can see *where* the extra cost
falls inside a real line of code.
"""
from __future__ import annotations
@mnm-matin
mnm-matin / probe.py
Created April 18, 2026 11:56
Probing the Opus 4.7 tokenizer atom by atom: which BPE merges did Anthropic drop? (177 atoms via /v1/messages/count_tokens, claude-opus-4-6 vs claude-opus-4-7)
"""Atom-level probe: where did Opus 4.7's tokenizer break merges that 4.6 had?
For each "atom" string we ask both models: how many tokens?
- If both report 1 token, both vocabs contain that exact merge.
- If 4.6=1 and 4.7=N>1, the merge was dropped in the new vocab.
- If 4.6=K and 4.7=K, atom is fragmented identically (likely byte-level
fallback or character-level, so vocab changes don't affect it).
Prepending a sentinel char and subtracting 1 isolates the atom from BOS/system
overhead. We do this for every atom and tag each with its category.
@mnm-matin
mnm-matin / tokens_per_byte.py
Created April 17, 2026 15:46
Opus 4.7 tokenizer: tokens per UTF-8 byte across 10 languages/content types. English & code +30-58%, non-Latin scripts (CJK/Arabic/Hindi) +2-4%.
"""Tokens-per-byte follow-up: who actually got more expensive?
The first post measured Δ% in token count. That's a ratio. This measures
the absolute efficiency of each tokenizer — tokens per UTF-8 byte — across
a broader language set.
Usage:
export ANTHROPIC_API_KEY=sk-ant-...
uv run --with anthropic python tokens_per_byte.py
"""
@mnm-matin
mnm-matin / opus_4_7_tokenizer_tax.py
Created April 17, 2026 14:48
Measuring the Opus 4.7 tokenizer tax: +25% tokens on average for the same source text (claude-opus-4-6 vs claude-opus-4-7, via /v1/messages/count_tokens)
"""Compare claude-opus-4-6 vs claude-opus-4-7 tokenizer across content types.
Usage:
export ANTHROPIC_API_KEY=sk-ant-...
uv run --with anthropic python opus_4_7_tokenizer_tax.py
Hits api.anthropic.com/v1/messages/count_tokens for each sample and prints
the per-sample delta plus a dollar estimate at 1B input tokens/month.
"""
from __future__ import annotations