Created
September 18, 2019 17:05
-
-
Save DGrady/55d968caab5637f4df38c27ed8774fbb to your computer and use it in GitHub Desktop.
Demonstration of the `char_wb` tokenization strategy in scikit-learn
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"How does `scikit-learn`'s `char_wb` n-gram strategy handle CJK characters, as opposed to accented Latin characters?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.497757Z", | |
"start_time": "2019-09-17T23:54:31.962664Z" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.feature_extraction.text import CountVectorizer, VectorizerMixin\n", | |
"import unicodedata" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.507018Z", | |
"start_time": "2019-09-17T23:54:32.500679Z" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"v = CountVectorizer(\n", | |
" strip_accents='unicode',\n", | |
" lowercase=True,\n", | |
" analyzer=\"char_wb\",\n", | |
" max_features=20_000,\n", | |
" ngram_range=(3, 3),\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The vectorizer provides functions for *preprocessing* the input strings, *tokenizing* strings, and then *analyzing* tokens. I believe that by default, the analyzer is built automatically by composing the preprocessor and tokenizer, and then directly constructing n-grams from the tokens.\n", | |
"\n", | |
"It looks like the only difference for the `char_wb` analyzer is that it constructs n-grams from the token contents, rather than n-grams of the tokens." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The implementation of the `strip_unicode` feature is in `sklearn/feature_extraction/text.py`; all it's doing is using the `unicodedata` module to apply NFKD normalization (which breaks out a single \"composed\" code point into multiple code points, for example an accented a is normalized to an a followed by a combining accent), and then filtering the resulting string using `unicodedata.combining`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.598912Z", | |
"start_time": "2019-09-17T23:54:32.510644Z" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"docs = ['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This will be displayed the same way by the browser, although they contain different code point sequences." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.615509Z", | |
"start_time": "2019-09-17T23:54:32.600982Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"nfkd_docs = list(map(lambda s: unicodedata.normalize('NFKD', s), docs))\n", | |
"nfkd_docs" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.625548Z", | |
"start_time": "2019-09-17T23:54:32.617236Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['LATIN SMALL LETTER V',\n", | |
" 'LATIN SMALL LETTER I',\n", | |
" 'LATIN SMALL LETTER V',\n", | |
" 'LATIN SMALL LETTER I WITH ACUTE',\n", | |
" 'LATIN SMALL LETTER A']" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"[unicodedata.name(c) for c in docs[0]]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.638173Z", | |
"start_time": "2019-09-17T23:54:32.627365Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['LATIN SMALL LETTER V',\n", | |
" 'LATIN SMALL LETTER I',\n", | |
" 'LATIN SMALL LETTER V',\n", | |
" 'LATIN SMALL LETTER I',\n", | |
" 'COMBINING ACUTE ACCENT',\n", | |
" 'LATIN SMALL LETTER A']" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"[unicodedata.name(c) for c in nfkd_docs[0]]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.646182Z", | |
"start_time": "2019-09-17T23:54:32.640229Z" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"p = v.build_preprocessor()\n", | |
"t = v.build_tokenizer()\n", | |
"a = v.build_analyzer()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.659726Z", | |
"start_time": "2019-09-17T23:54:32.649383Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['vivia', 'rocin flaco', '孔子', '老子', '毛泽东']" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"list(map(p, docs))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.670165Z", | |
"start_time": "2019-09-17T23:54:32.662398Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[['vivía'], ['rocín', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"list(map(t, docs))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.682723Z", | |
"start_time": "2019-09-17T23:54:32.674658Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[['vivia'], ['rocin', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"list(map(t, map(p, docs)))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:32.695601Z", | |
"start_time": "2019-09-17T23:54:32.684889Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[[' vi', 'viv', 'ivi', 'via', 'ia '],\n", | |
" [' ro', 'roc', 'oci', 'cin', 'in ', ' fl', 'fla', 'lac', 'aco', 'co '],\n", | |
" [' 孔子', '孔子 '],\n", | |
" [' 老子', '老子 '],\n", | |
" [' 毛泽', '毛泽东', '泽东 ']]" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"list(map(a, docs))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:38.804198Z", | |
"start_time": "2019-09-17T23:54:38.796762Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',\n", | |
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n", | |
" lowercase=True, max_df=1.0, max_features=20000, min_df=1,\n", | |
" ngram_range=(3, 3), preprocessor=None, stop_words=None,\n", | |
" strip_accents='unicode', token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", | |
" tokenizer=None, vocabulary=None)" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"v.fit(docs)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"ExecuteTime": { | |
"end_time": "2019-09-17T23:54:44.725189Z", | |
"start_time": "2019-09-17T23:54:44.720566Z" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{' vi': 2,\n", | |
" 'viv': 17,\n", | |
" 'ivi': 12,\n", | |
" 'via': 16,\n", | |
" 'ia ': 10,\n", | |
" ' ro': 1,\n", | |
" 'roc': 15,\n", | |
" 'oci': 14,\n", | |
" 'cin': 7,\n", | |
" 'in ': 11,\n", | |
" ' fl': 0,\n", | |
" 'fla': 9,\n", | |
" 'lac': 13,\n", | |
" 'aco': 6,\n", | |
" 'co ': 8,\n", | |
" ' 孔子': 3,\n", | |
" '孔子 ': 18,\n", | |
" ' 老子': 5,\n", | |
" '老子 ': 21,\n", | |
" ' 毛泽': 4,\n", | |
" '毛泽东': 19,\n", | |
" '泽东 ': 20}" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"v.vocabulary_" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.4" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment