User - --
You are a highly experienced senior software engineer with a specialization in code analysis.
Assistant - --
I'm ready to help with code analysis and software engineering tasks. Here's what I can assist with:
| Area | Examples |
|---|---|
| Code Review | Identifying bugs, anti-patterns, security vulnerabilities, and code smells |
| Architecture Analysis | Evaluating design patterns, system structure, coupling/cohesion, and scalability |
| Performance Analysis | Spotting bottlenecks, memory leaks, algorithmic inefficiencies ( |
| Security Auditing | Detecting injection flaws, auth issues, data exposure risks (OWASP Top 10) |
| Refactoring Guidance | Improving readability, maintainability, and adherence to SOLID principles |
| Debugging | Root cause analysis, tracing logic errors, diagnosing runtime failures |
| Legacy Code | Understanding undocumented systems, migration strategies, modernization paths |
- 🔍 Thorough — I examine code line-by-line, considering edge cases and failure modes
- 📐 Principled — I reference established patterns, standards, and best practices
- ⚖️ Balanced — I weigh trade-offs (e.g., readability vs. performance) rather than being dogmatic
- 🎯 Actionable — I provide concrete suggestions, not just observations
Go ahead — paste your code, describe your architecture, or ask your question. I'm ready to dive in.
User - --
Today, we're taking a look at the chardet python library. What do you know about this library and its uses?
Assistant - --
chardet is a Python library that detects the character encoding of text data (byte strings). It's a port of Mozilla's Universal Charset Detector (originally written in C++ for Mozilla Firefox), which uses a set of statistical and heuristic techniques to guess the encoding of a given byte sequence.
Character encoding is one of the classic "hard problems" in software. When you receive raw bytes — from a file, a network response, an API, a database — there is no guaranteed way to know the encoding unless it's explicitly declared (e.g., via HTTP headers, BOM, or metadata). chardet fills this gap.
import chardet
raw_data = b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
result = chardet.detect(raw_data)
# {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}| Feature | Description |
|---|---|
| Multi-encoding support | UTF-8, Shift_JIS, EUC-JP, GB2312, ISO-8859-*, Windows-1252, Big5, KOI8-R, and many more |
| Confidence score | Returns a |
| Incremental detection |
UniversalDetector class allows feeding data in chunks (streaming) |
| CLI tool | Ships with chardetect command-line utility |
| Language hint | Sometimes returns the detected language alongside the encoding |
Under the hood, chardet uses a layered approach combining multiple detection strategies:
- BOM Detection — Checks for Byte Order Marks (UTF-8 BOM, UTF-16 LE/BE, UTF-32)
- Escape-based Detection — Identifies escape sequences (ISO-2022-JP, HZ-GB-2312)
- Multi-byte Detection — State machine–based verification for multi-byte encodings (UTF-8, Shift_JIS, EUC-KR, etc.)
- Single-byte Detection — Statistical character frequency analysis for single-byte encodings (Latin-1, Windows-1252, KOI8-R, etc.)
Architecture Diagram (Conceptual)
Raw Bytes
│
▼
┌──────────────┐
│ BOM Check │──── Match? → Return encoding
└──────┬───────┘
│ No BOM
▼
┌──────────────┐
│ Escape Check │──── Escape seq found? → Return encoding
└──────┬───────┘
│ No escapes
▼
┌─────────────────────────────────┐
│ Parallel Probers (MBCSGroup, │
│ SBCSGroup, Latin1, UTF-8) │
│ ┌─────────┐ ┌─────────────┐ │
│ │Multibyte│ │ Single-byte │ │
│ │ Probers │ │ Probers │ │
│ └─────────┘ └─────────────┘ │
└──────────────┬──────────────────┘
│
▼
Highest confidence
encoding returned
- Web scraping — Decoding pages where
Content-Typeheaders are missing or wrong - File processing — Reading CSVs, logs, or text files of unknown origin
- Email parsing — Handling MIME messages with mislabeled encodings
- Data pipelines / ETL — Normalizing text data from heterogeneous sources
- Legacy system integration — Interfacing with older systems that use non-UTF-8 encodings
- Not infallible — Encoding detection is inherently probabilistic; short strings and overlapping byte ranges (e.g., Latin-1 vs. Windows-1252) can produce wrong results
- Performance — Can be slow on large files; the incremental
UniversalDetectorwith early termination helps cchardet— A faster C-based alternative (wrapping Mozilla'suchardet) that's API-compatible, though less actively maintained nowcharset-normalizer— A modern pure-Python alternative (used byrequestssince v2.28+) that takes a different statistical approach
Want me to dive into the source code architecture, analyze specific modules, or review how a particular detection technique is implemented?
User - --
How does the chardet achieve this? Explain in detail, with shortened code excerpts from the library itself if helpful to the explanation.
Assistant - --
chardet's architecture mirrors Mozilla's original design: a pipeline that starts with cheap, deterministic checks and falls back to increasingly sophisticated statistical analysis. At its core, it's a system of competing probers — each one championing a particular encoding — that are fed bytes in parallel. The prober with the highest confidence wins.
Everything begins simply:
def detect(byte_str):
"""Detect the encoding of the given byte string."""
detector = UniversalDetector()
detector.feed(byte_str)
return detector.close()The real work lives in UniversalDetector.
This is the central coordinator. It maintains a state machine and delegates to probers. Its feed() method implements a cascading detection strategy.
The very first thing feed() does is check for a Byte Order Mark — a deterministic, zero-ambiguity signal:
def feed(self, byte_str):
if self._got_data:
# ...
if not self._got_data:
# Check for BOM
if byte_str.startswith(codecs.BOM_UTF8):
self.result = {"encoding": "UTF-8-SIG",
"confidence": 1.0, "language": ""}
elif byte_str.startswith((codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)):
self.result = {"encoding": "UTF-32",
"confidence": 1.0, "language": ""}
elif byte_str.startswith((codecs.BOM_LE, codecs.BOM_BE)):
self.result = {"encoding": "UTF-16",
"confidence": 1.0, "language": ""}If a BOM is found → done immediately, confidence
If no BOM, the detector classifies the input by scanning each byte:
for byte in byte_str:
# Detect escape sequences (ISO-2022-*)
if byte in (0x1B, 0x7E, 0x18, 0x1A): # ESC, ~, CAN, SUB
self._input_state = InputState.ESC_ASCII
# Detect high bytes (non-ASCII: >= 0x80)
elif byte >= 0x80:
self._input_state = InputState.HIGH_BYTEThis classification determines which family of probers to activate:
InputState |
Meaning | Probers Activated |
|---|---|---|
PURE_ASCII |
All bytes < 0x80 | None needed (it's ASCII) |
ESC_ASCII |
Escape sequences found | EscCharSetProber |
HIGH_BYTE |
Bytes ≥ 0x80 present | MBCSGroupProber, SBCSGroupProber, Latin1Prober, UTF8Prober |
The probers are created lazily and fed the byte stream:
if self._input_state == InputState.ESC_ASCII:
if not self._esc_charset_prober:
self._esc_charset_prober = EscCharSetProber(self.lang_filter)
if self._esc_charset_prober.feed(byte_str) == ProbingState.FOUND_IT:
self.result = {... encoding, confidence ...}
self.done = True
elif self._input_state == InputState.HIGH_BYTE:
if not self._charset_probers:
self._charset_probers = [UTF1252Prober(), MBCSGroupProber(...),
SBCSGroupProber(...), Latin1Prober()]
for prober in self._charset_probers:
if prober.feed(byte_str) == ProbingState.FOUND_IT:
self.result = {... encoding, confidence ...}
self.done = True
breakIf no prober returned FOUND_IT during feeding, close() polls all probers for their confidence and picks the winner:
def close(self):
# ...
if self._input_state == InputState.HIGH_BYTE:
max_prober_confidence = 0.0
max_prober = None
for prober in self._charset_probers:
prober_confidence = prober.get_confidence()
if prober_confidence > max_prober_confidence:
max_prober_confidence = prober_confidence
max_prober = prober
if max_prober and max_prober_confidence > self.MINIMUM_THRESHOLD:
# MINIMUM_THRESHOLD = 0.20
self.result = {
"encoding": max_prober.charset_name,
"confidence": max_prober.get_confidence(),
"language": max_prober.language
}This is the heart of the system. All probers inherit from CharSetProber:
class CharSetProber:
def __init__(self, lang_filter=None):
self._state = ProbingState.DETECTING
def feed(self, buf):
raise NotImplementedError # Subclasses implement
def get_confidence(self):
return 0.0 # Subclasses override
@property
def charset_name(self):
raise NotImplementedErrorEach prober returns one of three ProbingState values:
DETECTING— still gathering evidenceFOUND_IT— this is definitively the encoding (confidence ≥ 0.99)NOT_ME— ruled out
UniversalDetector
│
├── EscCharSetProber
│ ├── CodingStateMachine(ISO2022_JP)
│ ├── CodingStateMachine(ISO2022_CN)
│ ├── CodingStateMachine(ISO2022_KR)
│ └── CodingStateMachine(HZ_GB_2312)
│
├── MBCSGroupProber (CharSetGroupProber)
│ ├── UTF8Prober ................... [StateMachine]
│ ├── SJISProber ................... [StateMachine + Distribution + Context]
│ ├── EUCJPProber .................. [StateMachine + Distribution + Context]
│ ├── GB2312Prober ................. [StateMachine + Distribution]
│ ├── EUCKRProber .................. [StateMachine + Distribution]
│ ├── CP949Prober .................. [StateMachine + Distribution]
│ ├── Big5Prober ................... [StateMachine + Distribution]
│ ├── EUCTWProber .................. [StateMachine + Distribution]
│ └── JOHABProber .................. [StateMachine + Distribution]
│
├── SBCSGroupProber (CharSetGroupProber)
│ ├── SingleByteCharSetProber × ~30 [Sequence Analysis per language model]
│ │ (Win1251-Bulgarian, ISO8859_5-Russian, KOI8R, Win1253-Greek,
│ │ Latin5-Turkish, Win1255-Hebrew, TIS620-Thai, etc.)
│ └── HebrewProber ................. [Special visual/logical Hebrew logic]
│
└── Latin1Prober ....................... [Byte frequency heuristic]
For multi-byte and escape encodings, chardet verifies that byte sequences are structurally valid using finite state machines.
Each encoding has strict rules for valid byte sequences. For example, UTF-8:
A single byte that violates these patterns → MachineState.ERROR → encoding ruled out.
class CodingStateMachine:
def __init__(self, sm):
self._model = sm
self._curr_state = MachineState.START
def next_state(self, c):
# Look up what class this byte belongs to
byte_class = self._model['class_table'][c]
if self._curr_state == MachineState.START:
self._curr_byte_pos = 0
self._curr_char_len = self._model['char_len_table'][byte_class]
# Transition: (current_state, byte_class) → next_state
curr_state = (self._curr_state * self._model['class_factor']
+ byte_class)
self._curr_state = self._model['state_table'][curr_state]
return self._curr_stateUTF8_SM_MODEL = {
'class_table': (
# Maps each byte 0x00–0xFF to a class (0–5 for UTF-8)
# Class 0: ASCII (0x00-0x7F)
# Class 1: continuation bytes (0x80-0xBF)
# Class 2: 0xC0-0xDF (2-byte lead)
# Class 3: 0xE0-0xEF (3-byte lead)
# Class 4: 0xF0-0xF7 (4-byte lead)
1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1, ... # 256 entries
),
'class_factor': 6, # number of classes
'state_table': (
# (state × class_factor + class) → next state
# START, ERROR, ITS_ME, or intermediate states
...
),
'char_len_table': (0, 1, 0, 0, 0, 2, 3, 4), # char length per class
}The state machine transitions look conceptually like:
State: START
+ ASCII byte → START (valid single-byte char, reset)
+ 2-byte lead → State_S1 (expecting 1 continuation)
+ 3-byte lead → State_S2 (expecting 2 continuations)
+ continuation → ERROR (unexpected here)
State: S1 (expecting 1 continuation)
+ continuation → START (valid 2-byte char, reset)
+ anything else → ERROR
State: S2 (expecting 2 continuations)
+ continuation → State_S1 (got first, expecting one more)
+ anything else → ERROR
If the machine ever reaches ITS_ME, the encoding is definitively confirmed. If ERROR, it's ruled out.
Structural validity alone isn't enough. Valid Shift_JIS bytes could also be valid EUC-JP bytes. So multi-byte probers combine two signals:
As described above — rules out impossible encodings.
Checks whether the decoded characters follow the expected frequency distribution for a given language.
class CharDistributionAnalysis:
ENOUGH_DATA_THRESHOLD = 1024
SURE_YES = 0.99
SURE_NO = 0.01
MINIMUM_DATA_THRESHOLD = 3
def feed(self, char, char_len):
if char_len == 2:
order = self.get_order(char) # Map decoded char to frequency rank
if order >= 0:
self._total_chars += 1
if order < self._table_size:
if 512 > self._char_to_freq_order[order]:
self._freq_chars += 1 # It's a common character
def get_confidence(self):
# Ratio of frequently-used chars vs rare chars,
# normalized by the typical distribution ratio for this language
if self._total_chars <= 0 or self._freq_chars <= self.MINIMUM_DATA_THRESHOLD:
return self.SURE_NO
if self._total_chars != self._freq_chars:
r = (self._freq_chars /
((self._total_chars - self._freq_chars)
* self.typical_distribution_ratio))
if r < self.SURE_YES:
return r
return self.SURE_YESThe key insight: every language has a skewed character frequency distribution. In Japanese, a small set of kanji account for the vast majority of text. If the byte stream, when decoded as Shift_JIS, yields characters that are overwhelmingly in the "top 512 most common" kanji → high confidence.
Each encoding subclass provides its own frequency table:
class SJISDistributionAnalysis(CharDistributionAnalysis):
def __init__(self):
super().__init__()
self._char_to_freq_order = SJIS_CHAR_TO_FREQ_ORDER # large lookup table
self._table_size = SJIS_TABLE_SIZE
self.typical_distribution_ratio = SJIS_TYPICAL_DISTRIBUTION_RATIOJapanese probers have an additional signal: SJISContextAnalysis / EUCJPContextAnalysis. These analyze the category of characters (hiragana, katakana, kanji, symbol) and how often hiragana appears relative to katakana — a distinctly Japanese text pattern.
class MultiByteCharSetProber(CharSetProber):
def feed(self, byte_str):
for i, byte in enumerate(byte_str):
coding_state = self.coding_sm.next_state(byte)
if coding_state == MachineState.ERROR:
self._state = ProbingState.NOT_ME
break
elif coding_state == MachineState.ITS_ME:
self._state = ProbingState.FOUND_IT
break
elif coding_state == MachineState.START:
# Completed a valid character — feed it to distribution analyzer
char_len = self.coding_sm.get_current_charlen()
if char_len >= 2:
self.distribution_analyzer.feed(byte_str[i+1-char_len:i+1],
char_len)
# ...
return self._state
def get_confidence(self):
return self.distribution_analyzer.get_confidence()Single-byte encodings are the hardest to distinguish because almost any byte sequence is structurally valid for most of them. There's no state machine to help. Instead, chardet relies entirely on statistical language models based on character pair (bigram) frequencies.
Each model contains:
# Example: Windows-1251 Bulgarian
Win1251BulgarianModel = {
'char_to_order_map': (
# Maps each byte 0x00–0xFF → frequency order (0 = most common,
# 255 = least common, 253 = symbol, 254 = undefined)
255,255,255,255,..., 30, 16, 32, 7, ... # 256 entries
),
'precedence_matrix': (
# 64×64 matrix flattened: how likely is char_class[i] followed by
# char_class[j]? Values 0–3:
# 0 = NEGATIVE (very unlikely pair)
# 1 = UNLIKELY
# 2 = LIKELY
# 3 = POSITIVE (very common pair)
0,3,3,3,3,3,3,3,... # 4096 entries (64×64)
),
'typical_positive_ratio': 0.969,
'keep_english_letter': False,
'charset_name': "windows-1251",
'language': "Bulgarian",
}class SingleByteCharSetProber(CharSetProber):
SAMPLE_SIZE = 64
POSITIVE_SHORTCUT_THRESHOLD = 0.95
NEGATIVE_SHORTCUT_THRESHOLD = 0.05
def feed(self, byte_str):
char_to_order_map = self._model['char_to_order_map']
for char in byte_str:
order = char_to_order_map[char]
if order < self.SAMPLE_SIZE: # It's a frequent-enough character
self._freq_char += 1
if self._last_order < self.SAMPLE_SIZE:
self._total_seqs += 1
# Look up bigram category in the precedence matrix
if not self._reversed:
idx = self._last_order * self.SAMPLE_SIZE + order
else:
idx = order * self.SAMPLE_SIZE + self._last_order
seq_category = self._model['precedence_matrix'][idx]
self._seq_counters[seq_category] += 1
self._last_order = order
def get_confidence(self):
if self._total_seqs > 0:
# Ratio of POSITIVE bigrams to total
r = (self._seq_counters[SequenceLikelihood.POSITIVE]
/ self._total_seqs / self._model['typical_positive_ratio'])
# Penalize by character frequency ratio
r = r * (self._freq_char / self._total_char)
return r
return 0.01The intuition:
In Bulgarian text encoded as Windows-1251, the letter pair "ст" (s-t) is extremely common →
POSITIVE. But if those same bytes are interpreted as KOI8-R Russian, they map to different characters whose pairing is unlikely →NEGATIVE. The encoding that produces the most natural-looking bigram distribution wins.
Group probers (MBCSGroupProber, SBCSGroupProber) are containers that run multiple probers simultaneously:
class CharSetGroupProber(CharSetProber):
def __init__(self, lang_filter=None):
self.probers = []
# subclasses populate self.probers
def feed(self, byte_str):
for prober in self.probers:
if not prober or prober.state == ProbingState.NOT_ME:
continue # Skip already-eliminated probers
state = prober.feed(byte_str)
if state == ProbingState.FOUND_IT:
self._best_guess_prober = prober
self._state = ProbingState.FOUND_IT
return self._state
# Check if all probers eliminated themselves
all_not_me = all(p.state == ProbingState.NOT_ME for p in self.probers if p)
if all_not_me:
self._state = ProbingState.NOT_ME
return self._state
def get_confidence(self):
best_conf = 0.0
self._best_guess_prober = None
for prober in self.probers:
if not prober or prober.state == ProbingState.NOT_ME:
continue
conf = prober.get_confidence()
if conf > best_conf:
best_conf = conf
self._best_guess_prober = prober
return best_confThis is the competitive elimination pattern — probers that encounter invalid bytes drop out (NOT_ME), the remaining ones compete on confidence.
Latin1Prober — The Catch-All
ISO-8859-1 / Windows-1252 accepts nearly any byte, so it uses a unique heuristic: classify bytes into frequency classes and look for typical Western European patterns.
class Latin1Prober(CharSetProber):
def feed(self, byte_str):
for c in byte_str:
char_class = Latin1_CharToClass[c]
freq = Latin1ClassModel[self._last_char_class * CLASS_NUM + char_class]
if freq == 0: # illegal combination
self._state = ProbingState.NOT_ME
break
self._freq_counter[freq] += 1
self._last_char_class = char_class
def get_confidence(self):
total = sum(self._freq_counter)
if total < 0.01:
confidence = 0.0
else:
confidence = (self._freq_counter[3] - self._freq_counter[1] * 20.0) / total
return max(confidence, 0.0)It essentially measures "how many character transitions look like normal Western European text vs. how many look suspicious."
HebrewProber — Visual vs. Logical
Hebrew is unique because the same text can be stored in visual order (right-to-left as displayed) or logical order (left-to-right reading order). The HebrewProber wraps two SingleByteCharSetProber instances (one for visual, one for logical) and uses heuristics about final letter forms to decide which is correct.
EscCharSetProber — Escape Sequences
ISO-2022 family encodings use escape sequences (e.g., ESC $ B for ISO-2022-JP) to switch between character sets. The prober runs state machines that look for these specific sequences:
class EscCharSetProber(CharSetProber):
def __init__(self, lang_filter=None):
super().__init__(lang_filter)
self.coding_sm = [
CodingStateMachine(HZ_SM_MODEL),
CodingStateMachine(ISO2022CN_SM_MODEL),
CodingStateMachine(ISO2022JP_SM_MODEL),
CodingStateMachine(ISO2022KR_SM_MODEL),
]If any state machine reaches ITS_ME, the encoding is confirmed with confidence
Input bytes
│
▼
┌─────────────────┐ Match ┌────────────────────────────┐
│ BOM Detection │───────────▶│ Return encoding (conf=1.0) │
└────────┬────────┘ └────────────────────────────┘
│ No BOM
▼
┌─────────────────┐
│ Byte Scan: │
│ Classify input │
└──┬─────┬────┬───┘
│ │ │
│ │ └─── PURE_ASCII ─▶ Return "ascii" (conf=1.0)
│ │
│ └─── ESC_ASCII ─▶ EscCharSetProber (state machines only)
│
└─── HIGH_BYTE
│
▼
┌──────────────────────────────────────────────────┐
│ Parallel Prober Competition │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ MBCS Group │ │ SBCS Group │ │
│ │ UTF8Prober │ │ ~30 language │ │
│ │ SJISProber │ │ models │ │
│ │ EUCJPProber │ │ (bigram │ │
│ │ GB2312... │ │ analysis) │ │
│ │ │ │ │ Latin1 │
│ │ [StateMach + │ │ [Sequence │ Prober │
│ │ Distribution│ │ Analysis] │ [Frequency] │
│ │ Analysis] │ │ │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Probers return: FOUND_IT / NOT_ME / DETECTING │
└──────────────────────┬───────────────────────────┘
│
▼
┌─────────────────┐
│ Pick highest │
│ confidence │
│ above threshold │ (MINIMUM_THRESHOLD = 0.20)
│ (0.0 → 1.0) │
└────────┬────────┘
│
▼
{"encoding": "...",
"confidence": 0.XX,
"language": "..."}
- Layered strategy: Cheap deterministic checks (BOM, escape sequences) first, expensive statistics last
- Competitive elimination: Invalid byte sequences immediately kill probers via state machines, narrowing the field
- Two statistical signals: Character frequency distribution (multi-byte) and bigram sequence analysis (single-byte)
- Pre-computed language models: The accuracy depends entirely on the quality of the frequency/precedence tables baked into the library
- Inherent uncertainty: This is fundamentally a probabilistic guessing game — the confidence score reflects this honestly, and short texts with overlapping encoding ranges will always be ambiguous
User - --
Which file is the MultiByteCharSetProber in?
Assistant - --
The MultiByteCharSetProber class is defined in:
chardet/mbcharsetprober.py
It serves as the base class for all the individual multi-byte encoding probers, which each live in their own files:
| Prober Class | File |
|---|---|
MultiByteCharSetProber (base) |
chardet/mbcharsetprober.py |
UTF8Prober |
chardet/utf8prober.py |
SJISProber |
chardet/sjisprober.py |
EUCJPProber |
chardet/eucjpprober.py |
GB2312Prober |
chardet/gb2312prober.py |
EUCKRProber |
chardet/euckrprober.py |
CP949Prober |
chardet/cp949prober.py |
Big5Prober |
chardet/big5prober.py |
EUCTWProber |
chardet/euctwprober.py |
JOHABProber |
chardet/johabprober.py |
The naming convention is straightforward — mb prefix stands for multi-byte, and each concrete prober inherits from it:
# chardet/sjisprober.py
from .mbcharsetprober import MultiByteCharSetProber
class SJISProber(MultiByteCharSetProber):
# ...User - --
Can you show me the full code for the MultiByteCharSetProber class?
Assistant - --
Here is the full code for the MultiByteCharSetProber class as it appears in the library:
######################## BEGIN LICENSE BLOCK ########################
# The Original Code is Mozilla Universal charset Detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Mark Pilgrim - port to Python
# Shy Shalom - original C code
# Dan Blanchard <dblanchard@ets.org> - maintenance
#
# This library is free software; you can redistribute it and/or
# modify it under the terms of the GNU Lesser General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This library is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
from .charsetprober import CharSetProber
from .enums import MachineState, ProbingState
class MultiByteCharSetProber(CharSetProber):
"""
MultiByteCharSetProber
"""
def __init__(self, lang_filter=None):
super().__init__(lang_filter=lang_filter)
self.distribution_analyzer = None
self.coding_sm = None
self._last_char = bytearray(2)
def reset(self):
super().reset()
if self.coding_sm:
self.coding_sm.reset()
if self.distribution_analyzer:
self.distribution_analyzer.reset()
self._last_char = bytearray(2)
def feed(self, byte_str):
for i in range(len(byte_str)):
coding_state = self.coding_sm.next_state(byte_str[i])
if coding_state == MachineState.ERROR:
self.logger.debug(
"%s %s prober hit error at byte %s",
self.charset_name,
self.language,
i,
)
self._state = ProbingState.NOT_ME
break
elif coding_state == MachineState.ITS_ME:
self._state = ProbingState.FOUND_IT
break
elif coding_state == MachineState.START:
char_len = self.coding_sm.get_current_charlen()
if i == 0:
self._last_char[1] = byte_str[0]
self.distribution_analyzer.feed(
self._last_char, char_len
)
else:
self.distribution_analyzer.feed(
byte_str[i - 1 : i + 1], char_len
)
self._last_char[0] = byte_str[-1]
if self.state == ProbingState.DETECTING:
if self.distribution_analyzer.got_enough_data() and (
self.get_confidence() > self.SHORTCUT_THRESHOLD
):
self._state = ProbingState.FOUND_IT
return self.state
def get_confidence(self):
return self.distribution_analyzer.get_confidence()| Aspect | Detail |
|---|---|
_last_char |
A 2-byte bytearray buffer that bridges chunk boundaries — stores the last byte from the previous feed() call so that a multi-byte character split across chunks is handled correctly |
| State machine loop | Feeds each byte to coding_sm.next_state() — an ERROR immediately eliminates this encoding, ITS_ME definitively confirms it |
| Distribution feeding | Only triggers when the state machine returns to START (meaning a complete valid character was just consumed). For 2+ byte characters, the decoded character bytes are fed to distribution_analyzer
|
| Early termination | If enough data has been gathered (got_enough_data()) and confidence exceeds SHORTCUT_THRESHOLD (FOUND_IT without needing more input |
get_confidence() |
Delegates entirely to the distribution analyzer — the state machine provides pass/fail structural validation, while the statistics provide the confidence score |
Note: I've reproduced this from my training data. While I'm confident this is highly accurate, minor differences (formatting, comments) may exist compared to a specific tagged release. I'd recommend verifying against the source on GitHub if exact fidelity is critical.
Compare to: https://github.com/chardet/chardet/blob/5.0.0/chardet/mbcharsetprober.py