Skip to content

Instantly share code, notes, and snippets.

@DGrady
Created September 18, 2019 17:05
Show Gist options
  • Save DGrady/55d968caab5637f4df38c27ed8774fbb to your computer and use it in GitHub Desktop.
Save DGrady/55d968caab5637f4df38c27ed8774fbb to your computer and use it in GitHub Desktop.
Demonstration of the `char_wb` tokenization strategy in scikit-learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does `scikit-learn`'s `char_wb` n-gram strategy handle CJK characters, as opposed to accented Latin characters?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.497757Z",
"start_time": "2019-09-17T23:54:31.962664Z"
}
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer, VectorizerMixin\n",
"import unicodedata"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.507018Z",
"start_time": "2019-09-17T23:54:32.500679Z"
}
},
"outputs": [],
"source": [
"v = CountVectorizer(\n",
" strip_accents='unicode',\n",
" lowercase=True,\n",
" analyzer=\"char_wb\",\n",
" max_features=20_000,\n",
" ngram_range=(3, 3),\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The vectorizer provides functions for *preprocessing* the input strings, *tokenizing* strings, and then *analyzing* tokens. I believe that by default, the analyzer is built automatically by composing the preprocessor and tokenizer, and then directly constructing n-grams from the tokens.\n",
"\n",
"It looks like the only difference for the `char_wb` analyzer is that it constructs n-grams from the token contents, rather than n-grams of the tokens."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The implementation of the `strip_unicode` feature is in `sklearn/feature_extraction/text.py`; all it's doing is using the `unicodedata` module to apply NFKD normalization (which breaks out a single \"composed\" code point into multiple code points, for example an accented a is normalized to an a followed by a combining accent), and then filtering the resulting string using `unicodedata.combining`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.598912Z",
"start_time": "2019-09-17T23:54:32.510644Z"
}
},
"outputs": [],
"source": [
"docs = ['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will be displayed the same way by the browser, although they contain different code point sequences."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.615509Z",
"start_time": "2019-09-17T23:54:32.600982Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nfkd_docs = list(map(lambda s: unicodedata.normalize('NFKD', s), docs))\n",
"nfkd_docs"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.625548Z",
"start_time": "2019-09-17T23:54:32.617236Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['LATIN SMALL LETTER V',\n",
" 'LATIN SMALL LETTER I',\n",
" 'LATIN SMALL LETTER V',\n",
" 'LATIN SMALL LETTER I WITH ACUTE',\n",
" 'LATIN SMALL LETTER A']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[unicodedata.name(c) for c in docs[0]]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.638173Z",
"start_time": "2019-09-17T23:54:32.627365Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['LATIN SMALL LETTER V',\n",
" 'LATIN SMALL LETTER I',\n",
" 'LATIN SMALL LETTER V',\n",
" 'LATIN SMALL LETTER I',\n",
" 'COMBINING ACUTE ACCENT',\n",
" 'LATIN SMALL LETTER A']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[unicodedata.name(c) for c in nfkd_docs[0]]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.646182Z",
"start_time": "2019-09-17T23:54:32.640229Z"
}
},
"outputs": [],
"source": [
"p = v.build_preprocessor()\n",
"t = v.build_tokenizer()\n",
"a = v.build_analyzer()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.659726Z",
"start_time": "2019-09-17T23:54:32.649383Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['vivia', 'rocin flaco', '孔子', '老子', '毛泽东']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(map(p, docs))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.670165Z",
"start_time": "2019-09-17T23:54:32.662398Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[['vivía'], ['rocín', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(map(t, docs))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.682723Z",
"start_time": "2019-09-17T23:54:32.674658Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[['vivia'], ['rocin', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(map(t, map(p, docs)))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:32.695601Z",
"start_time": "2019-09-17T23:54:32.684889Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[[' vi', 'viv', 'ivi', 'via', 'ia '],\n",
" [' ro', 'roc', 'oci', 'cin', 'in ', ' fl', 'fla', 'lac', 'aco', 'co '],\n",
" [' 孔子', '孔子 '],\n",
" [' 老子', '老子 '],\n",
" [' 毛泽', '毛泽东', '泽东 ']]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(map(a, docs))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:38.804198Z",
"start_time": "2019-09-17T23:54:38.796762Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=20000, min_df=1,\n",
" ngram_range=(3, 3), preprocessor=None, stop_words=None,\n",
" strip_accents='unicode', token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v.fit(docs)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-17T23:54:44.725189Z",
"start_time": "2019-09-17T23:54:44.720566Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{' vi': 2,\n",
" 'viv': 17,\n",
" 'ivi': 12,\n",
" 'via': 16,\n",
" 'ia ': 10,\n",
" ' ro': 1,\n",
" 'roc': 15,\n",
" 'oci': 14,\n",
" 'cin': 7,\n",
" 'in ': 11,\n",
" ' fl': 0,\n",
" 'fla': 9,\n",
" 'lac': 13,\n",
" 'aco': 6,\n",
" 'co ': 8,\n",
" ' 孔子': 3,\n",
" '孔子 ': 18,\n",
" ' 老子': 5,\n",
" '老子 ': 21,\n",
" ' 毛泽': 4,\n",
" '毛泽东': 19,\n",
" '泽东 ': 20}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v.vocabulary_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment