Whoosh's default analyzer does not handle CJK characters (in particular Chinese and Japanese) well. If you pass typical Chinese or Japanese paragraphes, often you'll find an entire sentence is treated as one token.
A Whoosh analyzer is consists of one tokenizer and zero or more filters. As a result, we can easily use this recipe from Lucene's CJKAnalyzer:
Analyzerthat tokenizes text with
StandardTokenizer, normalizes content with
CJKWidthFilter, folds case with
LowerCaseFilter, forms bigrams of CJK with
CJKBigramFilter, and filters stopwords with
Which inspired me to make this first take:
class CJKFilter(Filter): def __call__(self, tokens): ngt = NgramTokenizer(minsize=1, maxsize=2) for t in tokens: if len(t.text) > 0 and ord(t.text) >= 0x2e80: for t in ngt(t.text): t.pos = True yield t else: yield t
This is a flawed way of testing if a token contains CJK characters – I'm just testing if the first codepoint in the filtering text is or is larger than U+2E80, which the first codepoint of the CJK radicals. But as a first take, this already quite well.
Once we have this filter, we can then create our own analyzer:
my_analyzer = RegexTokenizer() | LowercaseFilter() | CJKFilter()
You can pipe the entire thing to
StopFilter() if you need to remove stop words.