This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Per-chunk token cost, side by side. Accurate counts via count_tokens. | |
| BPE with pre-tokenization handles whitespace-delimited chunks independently, | |
| so probing each chunk's token cost in isolation gives the true per-chunk | |
| budget. We display each chunk as a tile labelled with its token count, | |
| side by side for 4.6 and 4.7, so readers can see *where* the extra cost | |
| falls inside a real line of code. | |
| """ | |
| from __future__ import annotations |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Atom-level probe: where did Opus 4.7's tokenizer break merges that 4.6 had? | |
| For each "atom" string we ask both models: how many tokens? | |
| - If both report 1 token, both vocabs contain that exact merge. | |
| - If 4.6=1 and 4.7=N>1, the merge was dropped in the new vocab. | |
| - If 4.6=K and 4.7=K, atom is fragmented identically (likely byte-level | |
| fallback or character-level, so vocab changes don't affect it). | |
| Prepending a sentinel char and subtracting 1 isolates the atom from BOS/system | |
| overhead. We do this for every atom and tag each with its category. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Tokens-per-byte follow-up: who actually got more expensive? | |
| The first post measured Δ% in token count. That's a ratio. This measures | |
| the absolute efficiency of each tokenizer — tokens per UTF-8 byte — across | |
| a broader language set. | |
| Usage: | |
| export ANTHROPIC_API_KEY=sk-ant-... | |
| uv run --with anthropic python tokens_per_byte.py | |
| """ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """Compare claude-opus-4-6 vs claude-opus-4-7 tokenizer across content types. | |
| Usage: | |
| export ANTHROPIC_API_KEY=sk-ant-... | |
| uv run --with anthropic python opus_4_7_tokenizer_tax.py | |
| Hits api.anthropic.com/v1/messages/count_tokens for each sample and prints | |
| the per-sample delta plus a dollar estimate at 1B input tokens/month. | |
| """ | |
| from __future__ import annotations |