DGrady/scikit-learn-character-tokenization.ipynb

## scikit-learn-character-tokenization.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How does `scikit-learn`'s `char_wb` n-gram strategy handle CJK characters, as opposed to accented Latin characters?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.497757Z",
     "start_time": "2019-09-17T23:54:31.962664Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer, VectorizerMixin\n",
    "import unicodedata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.507018Z",
     "start_time": "2019-09-17T23:54:32.500679Z"
    }
   },
   "outputs": [],
   "source": [
    "v = CountVectorizer(\n",
    "    strip_accents='unicode',\n",
    "    lowercase=True,\n",
    "    analyzer=\"char_wb\",\n",
    "    max_features=20_000,\n",
    "    ngram_range=(3, 3),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The vectorizer provides functions for *preprocessing* the input strings, *tokenizing* strings, and then *analyzing* tokens. I believe that by default, the analyzer is built automatically by composing the preprocessor and tokenizer, and then directly constructing n-grams from the tokens.\n",
    "\n",
    "It looks like the only difference for the `char_wb` analyzer is that it constructs n-grams from the token contents, rather than n-grams of the tokens."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The implementation of the `strip_unicode` feature is in `sklearn/feature_extraction/text.py`; all it's doing is using the `unicodedata` module to apply NFKD normalization (which breaks out a single \"composed\" code point into multiple code points, for example an accented a is normalized to an a followed by a combining accent), and then filtering the resulting string using `unicodedata.combining`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.598912Z",
     "start_time": "2019-09-17T23:54:32.510644Z"
    }
   },
   "outputs": [],
   "source": [
    "docs = ['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will be displayed the same way by the browser, although they contain different code point sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.615509Z",
     "start_time": "2019-09-17T23:54:32.600982Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nfkd_docs = list(map(lambda s: unicodedata.normalize('NFKD', s), docs))\n",
    "nfkd_docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.625548Z",
     "start_time": "2019-09-17T23:54:32.617236Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['LATIN SMALL LETTER V',\n",
       " 'LATIN SMALL LETTER I',\n",
       " 'LATIN SMALL LETTER V',\n",
       " 'LATIN SMALL LETTER I WITH ACUTE',\n",
       " 'LATIN SMALL LETTER A']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[unicodedata.name(c) for c in docs[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.638173Z",
     "start_time": "2019-09-17T23:54:32.627365Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['LATIN SMALL LETTER V',\n",
       " 'LATIN SMALL LETTER I',\n",
       " 'LATIN SMALL LETTER V',\n",
       " 'LATIN SMALL LETTER I',\n",
       " 'COMBINING ACUTE ACCENT',\n",
       " 'LATIN SMALL LETTER A']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[unicodedata.name(c) for c in nfkd_docs[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.646182Z",
     "start_time": "2019-09-17T23:54:32.640229Z"
    }
   },
   "outputs": [],
   "source": [
    "p = v.build_preprocessor()\n",
    "t = v.build_tokenizer()\n",
    "a = v.build_analyzer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.659726Z",
     "start_time": "2019-09-17T23:54:32.649383Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['vivia', 'rocin flaco', '孔子', '老子', '毛泽东']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(p, docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.670165Z",
     "start_time": "2019-09-17T23:54:32.662398Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['vivía'], ['rocín', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(t, docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.682723Z",
     "start_time": "2019-09-17T23:54:32.674658Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['vivia'], ['rocin', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(t, map(p, docs)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:32.695601Z",
     "start_time": "2019-09-17T23:54:32.684889Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[' vi', 'viv', 'ivi', 'via', 'ia '],\n",
       " [' ro', 'roc', 'oci', 'cin', 'in ', ' fl', 'fla', 'lac', 'aco', 'co '],\n",
       " [' 孔子', '孔子 '],\n",
       " [' 老子', '老子 '],\n",
       " [' 毛泽', '毛泽东', '泽东 ']]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(a, docs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:38.804198Z",
     "start_time": "2019-09-17T23:54:38.796762Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',\n",
       "                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "                lowercase=True, max_df=1.0, max_features=20000, min_df=1,\n",
       "                ngram_range=(3, 3), preprocessor=None, stop_words=None,\n",
       "                strip_accents='unicode', token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "                tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "v.fit(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-09-17T23:54:44.725189Z",
     "start_time": "2019-09-17T23:54:44.720566Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{' vi': 2,\n",
       " 'viv': 17,\n",
       " 'ivi': 12,\n",
       " 'via': 16,\n",
       " 'ia ': 10,\n",
       " ' ro': 1,\n",
       " 'roc': 15,\n",
       " 'oci': 14,\n",
       " 'cin': 7,\n",
       " 'in ': 11,\n",
       " ' fl': 0,\n",
       " 'fla': 9,\n",
       " 'lac': 13,\n",
       " 'aco': 6,\n",
       " 'co ': 8,\n",
       " ' 孔子': 3,\n",
       " '孔子 ': 18,\n",
       " ' 老子': 5,\n",
       " '老子 ': 21,\n",
       " ' 毛泽': 4,\n",
       " '毛泽东': 19,\n",
       " '泽东 ': 20}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "v.vocabulary_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"How does `scikit-learn`'s `char_wb` n-gram strategy handle CJK characters, as opposed to accented Latin characters?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.497757Z",
	"start_time": "2019-09-17T23:54:31.962664Z"
	}
	},
	"outputs": [],
	"source": [
	"from sklearn.feature_extraction.text import CountVectorizer, VectorizerMixin\n",
	"import unicodedata"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.507018Z",
	"start_time": "2019-09-17T23:54:32.500679Z"
	}
	},
	"outputs": [],
	"source": [
	"v = CountVectorizer(\n",
	" strip_accents='unicode',\n",
	" lowercase=True,\n",
	" analyzer=\"char_wb\",\n",
	" max_features=20_000,\n",
	" ngram_range=(3, 3),\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The vectorizer provides functions for preprocessing the input strings, tokenizing strings, and then analyzing tokens. I believe that by default, the analyzer is built automatically by composing the preprocessor and tokenizer, and then directly constructing n-grams from the tokens.\n",
	"\n",
	"It looks like the only difference for the `char_wb` analyzer is that it constructs n-grams from the token contents, rather than n-grams of the tokens."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The implementation of the `strip_unicode` feature is in `sklearn/feature_extraction/text.py`; all it's doing is using the `unicodedata` module to apply NFKD normalization (which breaks out a single \"composed\" code point into multiple code points, for example an accented a is normalized to an a followed by a combining accent), and then filtering the resulting string using `unicodedata.combining`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.598912Z",
	"start_time": "2019-09-17T23:54:32.510644Z"
	}
	},
	"outputs": [],
	"source": [
	"docs = ['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This will be displayed the same way by the browser, although they contain different code point sequences."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.615509Z",
	"start_time": "2019-09-17T23:54:32.600982Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['vivía', 'rocín flaco', '孔子', '老子', '毛泽东']"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"nfkd_docs = list(map(lambda s: unicodedata.normalize('NFKD', s), docs))\n",
	"nfkd_docs"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.625548Z",
	"start_time": "2019-09-17T23:54:32.617236Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['LATIN SMALL LETTER V',\n",
	" 'LATIN SMALL LETTER I',\n",
	" 'LATIN SMALL LETTER V',\n",
	" 'LATIN SMALL LETTER I WITH ACUTE',\n",
	" 'LATIN SMALL LETTER A']"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"[unicodedata.name(c) for c in docs[0]]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.638173Z",
	"start_time": "2019-09-17T23:54:32.627365Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['LATIN SMALL LETTER V',\n",
	" 'LATIN SMALL LETTER I',\n",
	" 'LATIN SMALL LETTER V',\n",
	" 'LATIN SMALL LETTER I',\n",
	" 'COMBINING ACUTE ACCENT',\n",
	" 'LATIN SMALL LETTER A']"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"[unicodedata.name(c) for c in nfkd_docs[0]]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.646182Z",
	"start_time": "2019-09-17T23:54:32.640229Z"
	}
	},
	"outputs": [],
	"source": [
	"p = v.build_preprocessor()\n",
	"t = v.build_tokenizer()\n",
	"a = v.build_analyzer()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.659726Z",
	"start_time": "2019-09-17T23:54:32.649383Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['vivia', 'rocin flaco', '孔子', '老子', '毛泽东']"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(map(p, docs))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.670165Z",
	"start_time": "2019-09-17T23:54:32.662398Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[['vivía'], ['rocín', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(map(t, docs))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.682723Z",
	"start_time": "2019-09-17T23:54:32.674658Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[['vivia'], ['rocin', 'flaco'], ['孔子'], ['老子'], ['毛泽东']]"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(map(t, map(p, docs)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:32.695601Z",
	"start_time": "2019-09-17T23:54:32.684889Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[[' vi', 'viv', 'ivi', 'via', 'ia '],\n",
	" [' ro', 'roc', 'oci', 'cin', 'in ', ' fl', 'fla', 'lac', 'aco', 'co '],\n",
	" [' 孔子', '孔子 '],\n",
	" [' 老子', '老子 '],\n",
	" [' 毛泽', '毛泽东', '泽东 ']]"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"list(map(a, docs))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:38.804198Z",
	"start_time": "2019-09-17T23:54:38.796762Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',\n",
	" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
	" lowercase=True, max_df=1.0, max_features=20000, min_df=1,\n",
	" ngram_range=(3, 3), preprocessor=None, stop_words=None,\n",
	" strip_accents='unicode', token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
	" tokenizer=None, vocabulary=None)"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"v.fit(docs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2019-09-17T23:54:44.725189Z",
	"start_time": "2019-09-17T23:54:44.720566Z"
	}
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"{' vi': 2,\n",
	" 'viv': 17,\n",
	" 'ivi': 12,\n",
	" 'via': 16,\n",
	" 'ia ': 10,\n",
	" ' ro': 1,\n",
	" 'roc': 15,\n",
	" 'oci': 14,\n",
	" 'cin': 7,\n",
	" 'in ': 11,\n",
	" ' fl': 0,\n",
	" 'fla': 9,\n",
	" 'lac': 13,\n",
	" 'aco': 6,\n",
	" 'co ': 8,\n",
	" ' 孔子': 3,\n",
	" '孔子 ': 18,\n",
	" ' 老子': 5,\n",
	" '老子 ': 21,\n",
	" ' 毛泽': 4,\n",
	" '毛泽东': 19,\n",
	" '泽东 ': 20}"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"v.vocabulary_"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}