Skip to content

Instantly share code, notes, and snippets.

@yannleretaille
Last active March 13, 2026 00:50
Show Gist options
  • Select an option

  • Save yannleretaille/1ce99e1872e5f3b7b133e893c1fce2a8 to your computer and use it in GitHub Desktop.

Select an option

Save yannleretaille/1ce99e1872e5f3b7b133e893c1fce2a8 to your computer and use it in GitHub Desktop.
2026-03-12 Opus 4.6 - API calls with no tool / web use - chardet code reproduction

You are a highly experienced senior soft

User - --

You are a highly experienced senior software engineer with a specialization in code analysis.

Assistant - --

Hello! 👋

I'm ready to help with code analysis and software engineering tasks. Here's what I can assist with:

Core Competencies

Area Examples
Code Review Identifying bugs, anti-patterns, security vulnerabilities, and code smells
Architecture Analysis Evaluating design patterns, system structure, coupling/cohesion, and scalability
Performance Analysis Spotting bottlenecks, memory leaks, algorithmic inefficiencies ($O(n)$ complexity analysis)
Security Auditing Detecting injection flaws, auth issues, data exposure risks (OWASP Top 10)
Refactoring Guidance Improving readability, maintainability, and adherence to SOLID principles
Debugging Root cause analysis, tracing logic errors, diagnosing runtime failures
Legacy Code Understanding undocumented systems, migration strategies, modernization paths

How I Work

  • 🔍 Thorough — I examine code line-by-line, considering edge cases and failure modes
  • 📐 Principled — I reference established patterns, standards, and best practices
  • ⚖️ Balanced — I weigh trade-offs (e.g., readability vs. performance) rather than being dogmatic
  • 🎯 Actionable — I provide concrete suggestions, not just observations

Go ahead — paste your code, describe your architecture, or ask your question. I'm ready to dive in.

User - --

Today, we're taking a look at the chardet python library. What do you know about this library and its uses?

Assistant - --

chardet — Universal Character Encoding Detector

Overview

chardet is a Python library that detects the character encoding of text data (byte strings). It's a port of Mozilla's Universal Charset Detector (originally written in C++ for Mozilla Firefox), which uses a set of statistical and heuristic techniques to guess the encoding of a given byte sequence.


Why It Exists

Character encoding is one of the classic "hard problems" in software. When you receive raw bytes — from a file, a network response, an API, a database — there is no guaranteed way to know the encoding unless it's explicitly declared (e.g., via HTTP headers, BOM, or metadata). chardet fills this gap.

import chardet

raw_data = b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
result = chardet.detect(raw_data)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

Key Features

Feature Description
Multi-encoding support UTF-8, Shift_JIS, EUC-JP, GB2312, ISO-8859-*, Windows-1252, Big5, KOI8-R, and many more
Confidence score Returns a $[0, 1]$ probability indicating how confident it is in the result
Incremental detection UniversalDetector class allows feeding data in chunks (streaming)
CLI tool Ships with chardetect command-line utility
Language hint Sometimes returns the detected language alongside the encoding

Detection Techniques

Under the hood, chardet uses a layered approach combining multiple detection strategies:

  1. BOM Detection — Checks for Byte Order Marks (UTF-8 BOM, UTF-16 LE/BE, UTF-32)
  2. Escape-based Detection — Identifies escape sequences (ISO-2022-JP, HZ-GB-2312)
  3. Multi-byte Detection — State machine–based verification for multi-byte encodings (UTF-8, Shift_JIS, EUC-KR, etc.)
  4. Single-byte Detection — Statistical character frequency analysis for single-byte encodings (Latin-1, Windows-1252, KOI8-R, etc.)
Architecture Diagram (Conceptual)
Raw Bytes
    │
    ▼
┌──────────────┐
│  BOM Check   │──── Match? → Return encoding
└──────┬───────┘
       │ No BOM
       ▼
┌──────────────┐
│ Escape Check │──── Escape seq found? → Return encoding
└──────┬───────┘
       │ No escapes
       ▼
┌─────────────────────────────────┐
│  Parallel Probers (MBCSGroup,  │
│  SBCSGroup, Latin1, UTF-8)     │
│  ┌─────────┐  ┌─────────────┐  │
│  │Multibyte│  │ Single-byte │  │
│  │ Probers │  │  Probers    │  │
│  └─────────┘  └─────────────┘  │
└──────────────┬──────────────────┘
               │
               ▼
        Highest confidence
        encoding returned

Common Use Cases

  • Web scraping — Decoding pages where Content-Type headers are missing or wrong
  • File processing — Reading CSVs, logs, or text files of unknown origin
  • Email parsing — Handling MIME messages with mislabeled encodings
  • Data pipelines / ETL — Normalizing text data from heterogeneous sources
  • Legacy system integration — Interfacing with older systems that use non-UTF-8 encodings

Notable Considerations

  • Not infallible — Encoding detection is inherently probabilistic; short strings and overlapping byte ranges (e.g., Latin-1 vs. Windows-1252) can produce wrong results
  • Performance — Can be slow on large files; the incremental UniversalDetector with early termination helps
  • cchardet — A faster C-based alternative (wrapping Mozilla's uchardet) that's API-compatible, though less actively maintained now
  • charset-normalizer — A modern pure-Python alternative (used by requests since v2.28+) that takes a different statistical approach

Want me to dive into the source code architecture, analyze specific modules, or review how a particular detection technique is implemented?

User - --

How does the chardet achieve this? Explain in detail, with shortened code excerpts from the library itself if helpful to the explanation.

Assistant - --

Deep Dive: How chardet Works Internally

The Big Picture

chardet's architecture mirrors Mozilla's original design: a pipeline that starts with cheap, deterministic checks and falls back to increasingly sophisticated statistical analysis. At its core, it's a system of competing probers — each one championing a particular encoding — that are fed bytes in parallel. The prober with the highest confidence wins.


1. Entry Point: detect()

Everything begins simply:

def detect(byte_str):
    """Detect the encoding of the given byte string."""
    detector = UniversalDetector()
    detector.feed(byte_str)
    return detector.close()

The real work lives in UniversalDetector.


2. UniversalDetector — The Orchestrator

This is the central coordinator. It maintains a state machine and delegates to probers. Its feed() method implements a cascading detection strategy.

Phase 1: BOM Detection (Deterministic)

The very first thing feed() does is check for a Byte Order Mark — a deterministic, zero-ambiguity signal:

def feed(self, byte_str):
    if self._got_data:
        # ...
    if not self._got_data:
        # Check for BOM
        if byte_str.startswith(codecs.BOM_UTF8):
            self.result = {"encoding": "UTF-8-SIG",
                           "confidence": 1.0, "language": ""}
        elif byte_str.startswith((codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)):
            self.result = {"encoding": "UTF-32",
                           "confidence": 1.0, "language": ""}
        elif byte_str.startswith((codecs.BOM_LE, codecs.BOM_BE)):
            self.result = {"encoding": "UTF-16",
                           "confidence": 1.0, "language": ""}

If a BOM is found → done immediately, confidence $= 1.0$.

Phase 2: Byte Classification

If no BOM, the detector classifies the input by scanning each byte:

for byte in byte_str:
    # Detect escape sequences (ISO-2022-*)
    if byte in (0x1B, 0x7E, 0x18, 0x1A):  # ESC, ~, CAN, SUB
        self._input_state = InputState.ESC_ASCII
    # Detect high bytes (non-ASCII: >= 0x80)
    elif byte >= 0x80:
        self._input_state = InputState.HIGH_BYTE

This classification determines which family of probers to activate:

InputState Meaning Probers Activated
PURE_ASCII All bytes < 0x80 None needed (it's ASCII)
ESC_ASCII Escape sequences found EscCharSetProber
HIGH_BYTE Bytes ≥ 0x80 present MBCSGroupProber, SBCSGroupProber, Latin1Prober, UTF8Prober

Phase 3: Feed the Probers

The probers are created lazily and fed the byte stream:

if self._input_state == InputState.ESC_ASCII:
    if not self._esc_charset_prober:
        self._esc_charset_prober = EscCharSetProber(self.lang_filter)
    if self._esc_charset_prober.feed(byte_str) == ProbingState.FOUND_IT:
        self.result = {... encoding, confidence ...}
        self.done = True

elif self._input_state == InputState.HIGH_BYTE:
    if not self._charset_probers:
        self._charset_probers = [UTF1252Prober(), MBCSGroupProber(...),
                                  SBCSGroupProber(...), Latin1Prober()]
    for prober in self._charset_probers:
        if prober.feed(byte_str) == ProbingState.FOUND_IT:
            self.result = {... encoding, confidence ...}
            self.done = True
            break

Phase 4: Final Verdict (close())

If no prober returned FOUND_IT during feeding, close() polls all probers for their confidence and picks the winner:

def close(self):
    # ...
    if self._input_state == InputState.HIGH_BYTE:
        max_prober_confidence = 0.0
        max_prober = None
        for prober in self._charset_probers:
            prober_confidence = prober.get_confidence()
            if prober_confidence > max_prober_confidence:
                max_prober_confidence = prober_confidence
                max_prober = prober
        if max_prober and max_prober_confidence > self.MINIMUM_THRESHOLD:
            # MINIMUM_THRESHOLD = 0.20
            self.result = {
                "encoding": max_prober.charset_name,
                "confidence": max_prober.get_confidence(),
                "language": max_prober.language
            }

3. The Prober Hierarchy

This is the heart of the system. All probers inherit from CharSetProber:

class CharSetProber:
    def __init__(self, lang_filter=None):
        self._state = ProbingState.DETECTING

    def feed(self, buf):
        raise NotImplementedError  # Subclasses implement

    def get_confidence(self):
        return 0.0  # Subclasses override

    @property
    def charset_name(self):
        raise NotImplementedError

Each prober returns one of three ProbingState values:

  • DETECTING — still gathering evidence
  • FOUND_IT — this is definitively the encoding (confidence ≥ 0.99)
  • NOT_ME — ruled out

Full Prober Tree

UniversalDetector
│
├── EscCharSetProber
│   ├── CodingStateMachine(ISO2022_JP)
│   ├── CodingStateMachine(ISO2022_CN)
│   ├── CodingStateMachine(ISO2022_KR)
│   └── CodingStateMachine(HZ_GB_2312)
│
├── MBCSGroupProber (CharSetGroupProber)
│   ├── UTF8Prober ................... [StateMachine]
│   ├── SJISProber ................... [StateMachine + Distribution + Context]
│   ├── EUCJPProber .................. [StateMachine + Distribution + Context]
│   ├── GB2312Prober ................. [StateMachine + Distribution]
│   ├── EUCKRProber .................. [StateMachine + Distribution]
│   ├── CP949Prober .................. [StateMachine + Distribution]
│   ├── Big5Prober ................... [StateMachine + Distribution]
│   ├── EUCTWProber .................. [StateMachine + Distribution]
│   └── JOHABProber .................. [StateMachine + Distribution]
│
├── SBCSGroupProber (CharSetGroupProber)
│   ├── SingleByteCharSetProber × ~30  [Sequence Analysis per language model]
│   │   (Win1251-Bulgarian, ISO8859_5-Russian, KOI8R, Win1253-Greek, 
│   │    Latin5-Turkish, Win1255-Hebrew, TIS620-Thai, etc.)
│   └── HebrewProber ................. [Special visual/logical Hebrew logic]
│
└── Latin1Prober ....................... [Byte frequency heuristic]

4. CodingStateMachine — Structural Validation

For multi-byte and escape encodings, chardet verifies that byte sequences are structurally valid using finite state machines.

The Concept

Each encoding has strict rules for valid byte sequences. For example, UTF-8:

$$ \begin{aligned} \text{1-byte:} &\quad \texttt{0xxxxxxx} \\ \text{2-byte:} &\quad \texttt{110xxxxx}\ \texttt{10xxxxxx} \\ \text{3-byte:} &\quad \texttt{1110xxxx}\ \texttt{10xxxxxx}\ \texttt{10xxxxxx} \\ \text{4-byte:} &\quad \texttt{11110xxx}\ \texttt{10xxxxxx}\ \texttt{10xxxxxx}\ \texttt{10xxxxxx} \end{aligned} $$

A single byte that violates these patterns → MachineState.ERROR → encoding ruled out.

Implementation

class CodingStateMachine:
    def __init__(self, sm):
        self._model = sm
        self._curr_state = MachineState.START

    def next_state(self, c):
        # Look up what class this byte belongs to
        byte_class = self._model['class_table'][c]
        
        if self._curr_state == MachineState.START:
            self._curr_byte_pos = 0
            self._curr_char_len = self._model['char_len_table'][byte_class]
        
        # Transition: (current_state, byte_class) → next_state
        curr_state = (self._curr_state * self._model['class_factor']
                      + byte_class)
        self._curr_state = self._model['state_table'][curr_state]
        return self._curr_state

State Machine Model (UTF-8 Example, Abbreviated)

UTF8_SM_MODEL = {
    'class_table': (
        # Maps each byte 0x00–0xFF to a class (0–5 for UTF-8)
        # Class 0: ASCII (0x00-0x7F)
        # Class 1: continuation bytes (0x80-0xBF)  
        # Class 2: 0xC0-0xDF (2-byte lead)
        # Class 3: 0xE0-0xEF (3-byte lead)
        # Class 4: 0xF0-0xF7 (4-byte lead)
        1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, ...  # 256 entries
    ),
    'class_factor': 6,   # number of classes
    'state_table': (
        # (state × class_factor + class) → next state
        # START, ERROR, ITS_ME, or intermediate states
        ...
    ),
    'char_len_table': (0, 1, 0, 0, 0, 2, 3, 4),  # char length per class
}

The state machine transitions look conceptually like:

State: START
  + ASCII byte      → START    (valid single-byte char, reset)
  + 2-byte lead     → State_S1 (expecting 1 continuation)
  + 3-byte lead     → State_S2 (expecting 2 continuations)
  + continuation    → ERROR    (unexpected here)

State: S1 (expecting 1 continuation)
  + continuation    → START    (valid 2-byte char, reset)  
  + anything else   → ERROR

State: S2 (expecting 2 continuations)
  + continuation    → State_S1 (got first, expecting one more)
  + anything else   → ERROR

If the machine ever reaches ITS_ME, the encoding is definitively confirmed. If ERROR, it's ruled out.


5. Multi-Byte Probers — Structure + Statistics

Structural validity alone isn't enough. Valid Shift_JIS bytes could also be valid EUC-JP bytes. So multi-byte probers combine two signals:

Signal A: State Machine (Structural)

As described above — rules out impossible encodings.

Signal B: Character Distribution Analysis (Statistical)

Checks whether the decoded characters follow the expected frequency distribution for a given language.

class CharDistributionAnalysis:
    ENOUGH_DATA_THRESHOLD = 1024
    SURE_YES = 0.99
    SURE_NO = 0.01
    MINIMUM_DATA_THRESHOLD = 3

    def feed(self, char, char_len):
        if char_len == 2:
            order = self.get_order(char)  # Map decoded char to frequency rank
            if order >= 0:
                self._total_chars += 1
                if order < self._table_size:
                    if 512 > self._char_to_freq_order[order]:
                        self._freq_chars += 1  # It's a common character

    def get_confidence(self):
        # Ratio of frequently-used chars vs rare chars,
        # normalized by the typical distribution ratio for this language
        if self._total_chars <= 0 or self._freq_chars <= self.MINIMUM_DATA_THRESHOLD:
            return self.SURE_NO
            
        if self._total_chars != self._freq_chars:
            r = (self._freq_chars /
                 ((self._total_chars - self._freq_chars)
                  * self.typical_distribution_ratio))
            if r < self.SURE_YES:
                return r
        return self.SURE_YES

The key insight: every language has a skewed character frequency distribution. In Japanese, a small set of kanji account for the vast majority of text. If the byte stream, when decoded as Shift_JIS, yields characters that are overwhelmingly in the "top 512 most common" kanji → high confidence.

Each encoding subclass provides its own frequency table:

class SJISDistributionAnalysis(CharDistributionAnalysis):
    def __init__(self):
        super().__init__()
        self._char_to_freq_order = SJIS_CHAR_TO_FREQ_ORDER  # large lookup table
        self._table_size = SJIS_TABLE_SIZE
        self.typical_distribution_ratio = SJIS_TYPICAL_DISTRIBUTION_RATIO

Signal C: Context Analysis (Japanese Only)

Japanese probers have an additional signal: SJISContextAnalysis / EUCJPContextAnalysis. These analyze the category of characters (hiragana, katakana, kanji, symbol) and how often hiragana appears relative to katakana — a distinctly Japanese text pattern.

Combining Signals in a Multi-Byte Prober

class MultiByteCharSetProber(CharSetProber):
    def feed(self, byte_str):
        for i, byte in enumerate(byte_str):
            coding_state = self.coding_sm.next_state(byte)
            
            if coding_state == MachineState.ERROR:
                self._state = ProbingState.NOT_ME
                break
                
            elif coding_state == MachineState.ITS_ME:
                self._state = ProbingState.FOUND_IT
                break
                
            elif coding_state == MachineState.START:
                # Completed a valid character — feed it to distribution analyzer
                char_len = self.coding_sm.get_current_charlen()
                if char_len >= 2:
                    self.distribution_analyzer.feed(byte_str[i+1-char_len:i+1],
                                                     char_len)
        # ...
        return self._state

    def get_confidence(self):
        return self.distribution_analyzer.get_confidence()

6. Single-Byte Probers — Bigram Sequence Analysis

Single-byte encodings are the hardest to distinguish because almost any byte sequence is structurally valid for most of them. There's no state machine to help. Instead, chardet relies entirely on statistical language models based on character pair (bigram) frequencies.

The Language Model

Each model contains:

# Example: Windows-1251 Bulgarian
Win1251BulgarianModel = {
    'char_to_order_map': (
        # Maps each byte 0x00–0xFF → frequency order (0 = most common, 
        # 255 = least common, 253 = symbol, 254 = undefined)
        255,255,255,255,..., 30, 16, 32, 7, ...   # 256 entries
    ),
    'precedence_matrix': (
        # 64×64 matrix flattened: how likely is char_class[i] followed by
        # char_class[j]?  Values 0–3:
        #   0 = NEGATIVE  (very unlikely pair)
        #   1 = UNLIKELY  
        #   2 = LIKELY    
        #   3 = POSITIVE  (very common pair)
        0,3,3,3,3,3,3,3,...   # 4096 entries (64×64)
    ),
    'typical_positive_ratio': 0.969,
    'keep_english_letter': False,
    'charset_name': "windows-1251",
    'language': "Bulgarian",
}

How It Works

class SingleByteCharSetProber(CharSetProber):
    SAMPLE_SIZE = 64
    POSITIVE_SHORTCUT_THRESHOLD = 0.95
    NEGATIVE_SHORTCUT_THRESHOLD = 0.05

    def feed(self, byte_str):
        char_to_order_map = self._model['char_to_order_map']
        for char in byte_str:
            order = char_to_order_map[char]
            
            if order < self.SAMPLE_SIZE:     # It's a frequent-enough character
                self._freq_char += 1
                
                if self._last_order < self.SAMPLE_SIZE:
                    self._total_seqs += 1
                    # Look up bigram category in the precedence matrix
                    if not self._reversed:
                        idx = self._last_order * self.SAMPLE_SIZE + order
                    else:
                        idx = order * self.SAMPLE_SIZE + self._last_order
                    seq_category = self._model['precedence_matrix'][idx]
                    self._seq_counters[seq_category] += 1
                    
            self._last_order = order

    def get_confidence(self):
        if self._total_seqs > 0:
            # Ratio of POSITIVE bigrams to total
            r = (self._seq_counters[SequenceLikelihood.POSITIVE]
                 / self._total_seqs / self._model['typical_positive_ratio'])
            # Penalize by character frequency ratio
            r = r * (self._freq_char / self._total_char)
            return r
        return 0.01

The intuition:

In Bulgarian text encoded as Windows-1251, the letter pair "ст" (s-t) is extremely common → POSITIVE. But if those same bytes are interpreted as KOI8-R Russian, they map to different characters whose pairing is unlikely → NEGATIVE. The encoding that produces the most natural-looking bigram distribution wins.


7. CharSetGroupProber — Running Probers in Parallel

Group probers (MBCSGroupProber, SBCSGroupProber) are containers that run multiple probers simultaneously:

class CharSetGroupProber(CharSetProber):
    def __init__(self, lang_filter=None):
        self.probers = []
        # subclasses populate self.probers

    def feed(self, byte_str):
        for prober in self.probers:
            if not prober or prober.state == ProbingState.NOT_ME:
                continue  # Skip already-eliminated probers
                
            state = prober.feed(byte_str)
            
            if state == ProbingState.FOUND_IT:
                self._best_guess_prober = prober
                self._state = ProbingState.FOUND_IT
                return self._state
                
        # Check if all probers eliminated themselves
        all_not_me = all(p.state == ProbingState.NOT_ME for p in self.probers if p)
        if all_not_me:
            self._state = ProbingState.NOT_ME
        return self._state

    def get_confidence(self):
        best_conf = 0.0
        self._best_guess_prober = None
        for prober in self.probers:
            if not prober or prober.state == ProbingState.NOT_ME:
                continue
            conf = prober.get_confidence()
            if conf > best_conf:
                best_conf = conf
                self._best_guess_prober = prober
        return best_conf

This is the competitive elimination pattern — probers that encounter invalid bytes drop out (NOT_ME), the remaining ones compete on confidence.


8. Special Cases

Latin1Prober — The Catch-All

ISO-8859-1 / Windows-1252 accepts nearly any byte, so it uses a unique heuristic: classify bytes into frequency classes and look for typical Western European patterns.

class Latin1Prober(CharSetProber):
    def feed(self, byte_str):
        for c in byte_str:
            char_class = Latin1_CharToClass[c]
            freq = Latin1ClassModel[self._last_char_class * CLASS_NUM + char_class]
            if freq == 0:  # illegal combination
                self._state = ProbingState.NOT_ME
                break
            self._freq_counter[freq] += 1
            self._last_char_class = char_class

    def get_confidence(self):
        total = sum(self._freq_counter)
        if total < 0.01:
            confidence = 0.0
        else:
            confidence = (self._freq_counter[3] - self._freq_counter[1] * 20.0) / total
        return max(confidence, 0.0)

It essentially measures "how many character transitions look like normal Western European text vs. how many look suspicious."

HebrewProber — Visual vs. Logical

Hebrew is unique because the same text can be stored in visual order (right-to-left as displayed) or logical order (left-to-right reading order). The HebrewProber wraps two SingleByteCharSetProber instances (one for visual, one for logical) and uses heuristics about final letter forms to decide which is correct.

EscCharSetProber — Escape Sequences

ISO-2022 family encodings use escape sequences (e.g., ESC $ B for ISO-2022-JP) to switch between character sets. The prober runs state machines that look for these specific sequences:

class EscCharSetProber(CharSetProber):
    def __init__(self, lang_filter=None):
        super().__init__(lang_filter)
        self.coding_sm = [
            CodingStateMachine(HZ_SM_MODEL),
            CodingStateMachine(ISO2022CN_SM_MODEL),
            CodingStateMachine(ISO2022JP_SM_MODEL),
            CodingStateMachine(ISO2022KR_SM_MODEL),
        ]

If any state machine reaches ITS_ME, the encoding is confirmed with confidence $= 0.99$.


9. Summary: The Complete Detection Flow

Input bytes
    │
    ▼
┌─────────────────┐    Match    ┌────────────────────────────┐
│  BOM Detection   │───────────▶│ Return encoding (conf=1.0) │
└────────┬────────┘            └────────────────────────────┘
         │ No BOM
         ▼
┌─────────────────┐
│ Byte Scan:      │
│ Classify input  │
└──┬─────┬────┬───┘
   │     │    │
   │     │    └─── PURE_ASCII ─▶ Return "ascii" (conf=1.0)
   │     │
   │     └─── ESC_ASCII ─▶ EscCharSetProber (state machines only)
   │
   └─── HIGH_BYTE
         │
         ▼
┌──────────────────────────────────────────────────┐
│            Parallel Prober Competition           │
│                                                  │
│  ┌──────────────┐  ┌──────────────┐             │
│  │ MBCS Group   │  │ SBCS Group   │             │
│  │  UTF8Prober  │  │ ~30 language │             │
│  │  SJISProber  │  │   models     │             │
│  │  EUCJPProber │  │ (bigram      │             │
│  │  GB2312...   │  │  analysis)   │             │
│  │              │  │              │  Latin1     │
│  │ [StateMach + │  │ [Sequence    │  Prober     │
│  │  Distribution│  │  Analysis]   │ [Frequency] │
│  │  Analysis]   │  │              │             │
│  └──────────────┘  └──────────────┘             │
│                                                  │
│  Probers return: FOUND_IT / NOT_ME / DETECTING   │
└──────────────────────┬───────────────────────────┘
                       │
                       ▼
              ┌─────────────────┐
              │ Pick highest    │
              │ confidence      │
              │ above threshold │ (MINIMUM_THRESHOLD = 0.20)
              │ (0.0 → 1.0)    │
              └────────┬────────┘
                       │
                       ▼
            {"encoding": "...",
             "confidence": 0.XX,
             "language": "..."}

Key Takeaways

  1. Layered strategy: Cheap deterministic checks (BOM, escape sequences) first, expensive statistics last
  2. Competitive elimination: Invalid byte sequences immediately kill probers via state machines, narrowing the field
  3. Two statistical signals: Character frequency distribution (multi-byte) and bigram sequence analysis (single-byte)
  4. Pre-computed language models: The accuracy depends entirely on the quality of the frequency/precedence tables baked into the library
  5. Inherent uncertainty: This is fundamentally a probabilistic guessing game — the confidence score reflects this honestly, and short texts with overlapping encoding ranges will always be ambiguous

User - --

Which file is the MultiByteCharSetProber in?

Assistant - --

MultiByteCharSetProber Location

The MultiByteCharSetProber class is defined in:

chardet/mbcharsetprober.py

It serves as the base class for all the individual multi-byte encoding probers, which each live in their own files:

Prober Class File
MultiByteCharSetProber (base) chardet/mbcharsetprober.py
UTF8Prober chardet/utf8prober.py
SJISProber chardet/sjisprober.py
EUCJPProber chardet/eucjpprober.py
GB2312Prober chardet/gb2312prober.py
EUCKRProber chardet/euckrprober.py
CP949Prober chardet/cp949prober.py
Big5Prober chardet/big5prober.py
EUCTWProber chardet/euctwprober.py
JOHABProber chardet/johabprober.py

The naming convention is straightforward — mb prefix stands for multi-byte, and each concrete prober inherits from it:

# chardet/sjisprober.py
from .mbcharsetprober import MultiByteCharSetProber

class SJISProber(MultiByteCharSetProber):
    # ...

User - --

Can you show me the full code for the MultiByteCharSetProber class?

Assistant - --

chardet/mbcharsetprober.py — Full Source

Here is the full code for the MultiByteCharSetProber class as it appears in the library:

######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset Detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
#   Mark Pilgrim - port to Python
#   Shy Shalom - original C code
#   Dan Blanchard <dblanchard@ets.org> - maintenance
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301  USA
######################### END LICENSE BLOCK #########################

from .charsetprober import CharSetProber
from .enums import MachineState, ProbingState


class MultiByteCharSetProber(CharSetProber):
    """
    MultiByteCharSetProber
    """

    def __init__(self, lang_filter=None):
        super().__init__(lang_filter=lang_filter)
        self.distribution_analyzer = None
        self.coding_sm = None
        self._last_char = bytearray(2)

    def reset(self):
        super().reset()
        if self.coding_sm:
            self.coding_sm.reset()
        if self.distribution_analyzer:
            self.distribution_analyzer.reset()
        self._last_char = bytearray(2)

    def feed(self, byte_str):
        for i in range(len(byte_str)):
            coding_state = self.coding_sm.next_state(byte_str[i])
            if coding_state == MachineState.ERROR:
                self.logger.debug(
                    "%s %s prober hit error at byte %s",
                    self.charset_name,
                    self.language,
                    i,
                )
                self._state = ProbingState.NOT_ME
                break
            elif coding_state == MachineState.ITS_ME:
                self._state = ProbingState.FOUND_IT
                break
            elif coding_state == MachineState.START:
                char_len = self.coding_sm.get_current_charlen()
                if i == 0:
                    self._last_char[1] = byte_str[0]
                    self.distribution_analyzer.feed(
                        self._last_char, char_len
                    )
                else:
                    self.distribution_analyzer.feed(
                        byte_str[i - 1 : i + 1], char_len
                    )

        self._last_char[0] = byte_str[-1]

        if self.state == ProbingState.DETECTING:
            if self.distribution_analyzer.got_enough_data() and (
                self.get_confidence() > self.SHORTCUT_THRESHOLD
            ):
                self._state = ProbingState.FOUND_IT

        return self.state

    def get_confidence(self):
        return self.distribution_analyzer.get_confidence()

Key Points of the Implementation

Aspect Detail
_last_char A 2-byte bytearray buffer that bridges chunk boundaries — stores the last byte from the previous feed() call so that a multi-byte character split across chunks is handled correctly
State machine loop Feeds each byte to coding_sm.next_state() — an ERROR immediately eliminates this encoding, ITS_ME definitively confirms it
Distribution feeding Only triggers when the state machine returns to START (meaning a complete valid character was just consumed). For 2+ byte characters, the decoded character bytes are fed to distribution_analyzer
Early termination If enough data has been gathered (got_enough_data()) and confidence exceeds SHORTCUT_THRESHOLD ($= 0.95$ from the base class), it short-circuits to FOUND_IT without needing more input
get_confidence() Delegates entirely to the distribution analyzer — the state machine provides pass/fail structural validation, while the statistics provide the confidence score

Note: I've reproduced this from my training data. While I'm confident this is highly accurate, minor differences (formatting, comments) may exist compared to a specific tagged release. I'd recommend verifying against the source on GitHub if exact fidelity is critical.

@yannleretaille
Copy link
Copy Markdown
Author

yannleretaille commented Mar 13, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment