Skip to content

Instantly share code, notes, and snippets.

@DaiZack
Created February 24, 2019 22:08
Show Gist options
  • Save DaiZack/f902882e84db6f408d866aeeaf728175 to your computer and use it in GitHub Desktop.
Save DaiZack/f902882e84db6f408d866aeeaf728175 to your computer and use it in GitHub Desktop.
brock-university-tutorial-textmining-with-python
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
"_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
"collapsed": true
},
"source": [
"![](https://www.python.org/static/img/python-logo@2x.png) ![](https://brocku.ca/goodman/wp-content/uploads/primary-site/sites/6/centre-for-business-analytics-logo.png?x59852) \n",
"\n",
"# Introduction\n",
"This is an entry-level tutorial of TextMining With Python Created for Brock University [Goodman business school](https://brocku.ca/goodman/) Business analysis students. This tutorial teaches you the basic idea about python (Anocanda), how to run python code, data operation with python, and how to use python for Text Mining. ( I assume the reader of the audience have 0 knowledge about any programming language)\n",
"\n",
"Thanks, professor [Anteneh Ayanso](https://www.linkedin.com/in/aayanso/) for giving me this chance to create this tutorial.\n",
"\n",
"If you have any concern about this tutorial, you can reach me at my LinkedIn.([click here](https://www.linkedin.com/in/zhengang-dai/))\n",
"\n",
"# About Python\n",
"Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.[26] Van Rossum led the language community until stepping down as the leader in July 2018. --wikipedia.\n",
"\n",
"# Why Python\n",
"<img src=\"https://cdn-images-1.medium.com/max/1000/1*CExT2OJfdOpfI72dEYX6Mg.jpeg\" height=400 width=400>\n",
"\n",
"## Rank of program languages \n",
"[tiobe.com](https://www.tiobe.com/tiobe-index/) \n",
"\n",
"### 1. Beginner Friendliness\n",
"Python was designed to be easy to understand and fun to use (its name came from Monty Python so a lot of its beginner tutorials reference it). Fun is a great motivator, and since you'll be able to build prototypes and tools quickly with Python, many find coding in Python a satisfying experience. Thus, Python has gained popularity for being a beginner-friendly language, and it has replaced Java as the most popular introductory language at Top U.S. Universities.\n",
"\n",
"### 2. Easy to Understand\n",
"Being a very high-level language, Python reads like English, which takes a lot of syntax-learning stress off coding beginners. Python handles a lot of complexity for you, so it is very beginner-friendly in that it allows beginners to focus on learning programming concepts and not have to worry about too many details.\n",
"\n",
"### 3. Very Flexible\n",
"As a dynamically typed language, Python is really flexible. This means there are no hard rules on how to build features, and you'll have more flexibility solving problems using different methods (though the Python philosophy encourages using the obvious way to solve things). Furthermore, Python is also more forgiving of errors, so you'll still be able to compile and run your program until you hit the problematic part.\n",
"\n",
"### 4. Community\n",
"As you step into the programming world, you'll soon understand how vital support is, as the developer community is all about giving and receiving help. The larger a community, the more likely you'd get help and the more people will be building useful tools to ease the process of development.\n",
"\n",
"### 5. Multifunction\n",
"With Python, you can do almost anything you want. Web design, database operation(all databases), game design, commercial applications, information system, machine learning, text mining, and deep learning......\n",
"\n",
"If you only want to learn one programming language, no doubt, you should choose python!\n",
"\n",
"# Python 2 vs Python 3\n",
"Python 3.x is the future, and with Python 2.x support dwindling, you should put your time into learning the version that will help you into the future. So python 3 please, I am not offering you an option, just let you know, avoid python 2. (Though they are all named python, the syntax is a little different.)\n",
"\n",
"# Install python\n",
"<img src=\"https://www.anaconda.com/wp-content/uploads/2018/06/cropped-Anaconda_horizontal_RGB-1-600x102.png\" height=200>\n",
"In Brock University's labs, they have python(anaconda both 2 and 3) installed. I will skip this process in the class. Just introduce some software here.\n",
"### If you want to install Python on your machine, my recommendation is to install anaconda 3. GO to https://www.anaconda.com/distribution/ [Anaconda official website](https://www.anaconda.com/distribution/) download python 3.x **distribution** version choose your machine type. (Suggest you select the environment option.)\n",
"<img src=\"https://cdn-images-1.medium.com/max/1250/1*7a9zVyGP3iMXu9aB4e_Vhw.png\" height=500 width=500>\n",
"\n",
"## Additional options\n",
"### Jupyter Notebook\n",
"<img src=\"https://jupyter.org/assets/main-logo.svg\" height=200, width=200>\n",
"Most popular IDE for Python. If you are using anaconda, this package is already included.\n",
"\n",
"### Pycharm\n",
"![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/PyCharm_Logo.svg/192px-PyCharm_Logo.svg.png)\n",
"Install Pycharm form [Pycharm website](https://www.jetbrains.com/pycharm/download/#section=windows) dowload community version (Free!) Install on your machine\n",
"### PS:how to fix Interpreter field is empty in pycharm \n",
"[Youtobe Vedio](https://www.youtube.com/watch?v=ypSSGgKAjhc)\n",
"\n",
"### Kaggle\n",
"![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7c/Kaggle_logo.png/200px-Kaggle_logo.png)\n",
"You can run your script from the kaggle website by creating a kernel (Jupyter Notebook environment). [Kaggle Website](https://www.kaggle.com/)\n",
"\n",
"### PyPI\n",
"![](https://pypi.org/static/images/logo-large.72ad8bf1.svg)\n",
"The official place to find python libraries. [Pypi website](https://pypi.org/)\n",
"\n",
"### Github\n",
"![](https://avatars1.githubusercontent.com/u/9919?s=200&v=4)\n",
"The world's leading software development platform. [Github website](https://github.com/)\n",
"\n",
"### Stack Overflow\n",
"<img src=\"https://i0.wp.com/wptavern.com/wp-content/uploads/2016/07/stack-overflow.png?resize=768%2C301&ssl=1\" height=200 width=500>\n",
"Stack Overflow is a question and answer site for professional and enthusiast programmers.The biggest community. [Stack Overflow website](https://stackoverflow.com/) \n",
"\n",
"### Spyder\n",
"<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Spyder_logo.svg/1024px-Spyder_logo.svg.png\" width=\"200\" height=\"200\">\n",
"Another IDE inside anaconda package."
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "480ce9379306609ddb4a4e96a9104b9145e2eb03"
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "30a5b1415cb938dee37f1fb6f28669c27a0cd293"
},
"source": [
"# First Code \"Hello Python!\"\n",
"Jupyter notebook or Syder provide \"console\" to run your python code, witch means you can run your code line py line or model by model (You do not have to run you whole script one time)\n",
"Once you run part of your code, your defined variables are stored in the console (memory) and can be refered in the later codes.\n",
"\n",
"Lets run our first code:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"_uuid": "df0e80c278ff6b824e53350ade12651059ab2ffe"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hello Python!\n"
]
}
],
"source": [
"print('Hello Python!')"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "59e7929d5ebbd72301930f58143f5eca82268c8f"
},
"source": [
"<h3 style=\"color:red;\">Notice that python is case sensitive, which means upercase and lowercase are different!<br>\"print\" is different from \"Print\"<p>\n",
" Try the following code, you will get an error."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"_uuid": "ff9294b47e76d3177c92a2c7ab463530d7042cb0"
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'Print' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-2-3630f070b056>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mPrint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Hello Python!'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'Print' is not defined"
]
}
],
"source": [
"Print('Hello Python!')"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "f885354d1b24a0b49360c7ac11eb2b62e6890e3e"
},
"source": [
"# Variable\n",
"You can temporarily store you data in variables and use them later.\n",
"### Variable Names\n",
"* A variable can have a short name (like x and y) or a more descriptive name (age, carname, total_volume). Rules for Python variables:\n",
"* A variable name must start with a letter or the underscore character\n",
"* A variable name cannot start with a number\n",
"* A variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )\n",
"* Variable names are case-sensitive (age, Age and AGE are three different variables)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"_uuid": "e86778981ee094c01d5e3f1cdeaab04b7b2ee41d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"hello python\n"
]
}
],
"source": [
"p = 'hello python'\n",
"print(p)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "09c0eba495bddf8a93ae5195530b98b6bb7292ae"
},
"source": [
"# Operator\n",
"Different data types have different meanings on operators.\n",
"More infomation about operators can be found [here](https://www.w3schools.com/python/python_operators.asp)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"_uuid": "bac2c92935378feb39fed6bec9fcffcd3353b3ee"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2\n"
]
}
],
"source": [
"a = 1 + 1\n",
"print(a)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"_uuid": "ef233ca0477a84ddab86d48ea72385330acfdc22"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"11\n"
]
}
],
"source": [
"b = '1' + '1'\n",
"print(b)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"_uuid": "d6f0f1c6b56af255196b7d871c6503bad8bb4541"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1111\n"
]
}
],
"source": [
"c = b*2 # equal \"11\" * 2\n",
"print(c)\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"_uuid": "3ee0c8f3fce2f37036a212ed71d34f0f37804030"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"abcabc\n"
]
}
],
"source": [
"print('abc'*2)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"_uuid": "b1c2b6d1390785dc40b64a42a1e9cf562ca6e740"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"hellopython\n",
"hellopython\n",
"\n"
]
}
],
"source": [
"d = 'hello'\n",
"e = 'python'\n",
"print((d+e+'\\n')*2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "b3b59e0835a7413fe41d169353344c55c0aaca8c"
},
"source": [
"# Open data\n",
"With python you can easily read a file as a variable.\n",
"\n",
"As I am using kaggle server, the file 'WeatherAnimalsSports.csv' is stored at '../input/' folder\n",
"\n",
"### On your computer you can use a path like \"c:/files/data.csv\" to open your local data."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"_kg_hide-input": false,
"_kg_hide-output": false,
"_uuid": "98c2ee59e4b143dd3beb6ca9635d30af61b6d221"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Target_Subject,TextField\n",
"A,Bob has two dogs and one cat. The cat is bigger than either of the dogs.\n",
"S,Carmelo Anthony scored 42 points to lead the NY Knicks basketball team to a win over the Florida Pelicans.\n",
"S,Come play baseball with us.\n",
"S,\"Derek Jeter, the captain of the New York Yankees baseball team, said 2014 will be his last season playing.\"\n",
"S,Do you have a baseball or a football that we could play with? You can be on my team.\n",
"A,Do you like big dogs or little dogs? Dogs are such wonderful animals.\n",
"W,\"During the winter, the sun is lower in the sky than it is during the summer. That's why winter days are colder than summer days.\"\n",
"A,\"House cats behave very much like their big cousins, lions, tigers and leopards. They are all all efficient predators.\"\n",
"A,I have a friend who had 5 cats in her hourse. She's a true animal lover.\n",
"W,I like the springtime when the weather is not too hot nor too cold.\n",
"A,\"I think animals with spots and stripes, like tigers, leopards and zebras, are especially beautiful.\"\n",
"W,I think I prefer very hot weather to very cold weather. I like to go to the beach when it is hot and sunny.\n",
"S,I used to play Little League baseball and basketball when I was a kid.\n",
"W,\"If it rains tomorrow, let's not go outside. It is also supposed to be pretty cold.\"\n",
"W,\"If there is rain or snow, I am still going out. I will not let the weather stop me.\"\n",
"A,\"If we only have 30 minutes, should we visit the monkeys, or look at the elephants? My preference is the monkeys.\"\n",
"S,\"In the National Basketball Association, three All-Stars are among several sons of former players.\"\n",
"W,Jack and Mary could not go to the picnic because of bad weather. They rescheduled next Sunday when it should be a warm day.\n",
"W,Jack likes the snow and ice of winter. He does not like the hot weather of summer.\n",
"A,\"John went to the zoo and saw a lion, a tiger, elephants and zebras.\"\n",
"A,\"Lions are usually a little smaller than tigers. Cheetahs, jaguars and leopards are big cats but are all smaller than lions and tigers.\"\n",
"A,Mary likes to watch animal documentaries on television. She is especially fond of watching shows about big cats.\n",
"W,More snow is predicted for the Northeast.\n",
"S,My favorite baseball player of all times is Willie Mays.\n",
"A,My favorite zoo is the Bronx Zoo. I usually go see the polar bears first and then I go to the lions and tigers.\n",
"A,My favorite zoo is the San Diego Zoo. I love to watch the monkeys and gorillas.\n",
"A,\"Orca whales prey on seals and lions prey on zebras. Bears prey on deer, antelope and other ungulates, but they are omnivorous animals.\"\n",
"W,\"Phoenix, Arizona, had it's fourth hottest day on record in June, 2013, when the temperature reached 119 degrees. Summers are brutal there.\"\n",
"S,Second-half goals gave the Bayern Munich soccer team a victory in the first game of their Champion League series.\n",
"S,Ted Williams is the last person to have a batting average higher than .400 among all major league baseball players.\n",
"A,The biggest animal in the world is the blue whale.\n",
"A,The honey badger is a small but ferocious animal that can defend itself against bears and cougars.\n",
"S,\"The Japanese pitcher, Masahiro Tanaka, has signed a big contract to play baseball with the New York Yankees team.\"\n",
"A,The lions are near the zebras in the Bronx Zoo.\n",
"S,The number one rated Syracuse University basketball team lost to Boston College by a score of 62 to 59.\n",
"W,The snow fell heavily but then the rain came and washed it away.\n",
"S,The U.S. soccer team beat the Russian team in hockey at Sochi by a score of 4-3.\n",
"W,\"The weather in NYC is hot in the summer and cold in the winter, but we do not get as much snow as in Chicago.\"\n",
"W,\"There was so much rain, we could not go out. It was also very cold.\"\n",
"W,\"This has been a very difficult winter, much colder than usual with lots of snow, ice and rain.\"\n",
"S,UCLA has won the NCAA basketball championship many times.\n",
"W,\"We have had snow, snow and more snow for 10 days in a row.\"\n",
"A,\"We went to the zoo and saw a lion, a tiger, elephants and zebras, but we did not get a chance to see the monkeys.\"\n",
"A,\" If you like hot weather, the Arizona desert is very interesting to visit. There are lots of unusual animals living there, including wildcats and boars.\"\n",
"S,\"Which sport do you like more: soccer, baseball, football or basketball?\"\n",
"W,\"Winter days are often so pretty, even if it is cold. \"\n",
"W,Winter is my favorite season. I love the cold and the snow.\n",
"\n"
]
}
],
"source": [
"with open('../input/WeatherAnimalsSports.csv') as file:\n",
" f = file.read()\n",
"print(f)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "de2b725be165116f364d10156daedae64e18a678"
},
"source": [
"# Data Type\n",
"Here variable \"f\" is a \"string\", we can check its data type and length (how many letters in the variable) \n",
"\n",
"Python have many differnt data types, I will show some common ones below:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"_uuid": "e308a920ec52251751aabc6ec74bff0813fd0ccb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The type of f is: <class 'str'>\n",
"The length of f is: 4428\n"
]
}
],
"source": [
"print(\"The type of f is: \",type(f))\n",
"print(\"The length of f is: \", len(f) )"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"_uuid": "ddf2bb0fec65149b0259aa3bac8e8d20de4eab16"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'int'>\n",
"<class 'float'>\n",
"<class 'str'>\n",
"<class 'list'>\n",
"<class 'dict'>\n",
"<class 'builtin_function_or_method'>\n"
]
}
],
"source": [
"print(type(1))\n",
"print(type(1.1))\n",
"print(type('abc'))\n",
"print(type([1,2,3]))\n",
"print(type({\"name\":\"Jack\"}))\n",
"print(type(print))"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "0f8e783eb668e407fe80e81cf8795040595c3362"
},
"source": [
"# Python Library\n",
"Python’s standard library is very extensive, offering a wide range of facilities as indicated by the long table of contents listed below. The library contains built-in modules (written in C) that provide access to system functionality such as file I/O that would otherwise be inaccessible to Python programmers, as well as modules written in Python that provide standardized solutions for many problems that occur in everyday programming. Some of these modules are explicitly designed to encourage and enhance the portability of Python programs by abstracting away platform-specifics into platform-neutral APIs."
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "c3c728b89670040a5577925f7208157a06838496"
},
"source": [
"## Pandas\n",
"<p style=\"color:red\">The most important data management library in python!<p>\n",
" see: Pandas official [tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"_uuid": "2117e8945d8d02ce65598e41f07c191826e5033d"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Target_Subject</th>\n",
" <th>TextField</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>Bob has two dogs and one cat. The cat is bigg...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>S</td>\n",
" <td>Carmelo Anthony scored 42 points to lead the N...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>S</td>\n",
" <td>Come play baseball with us.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>S</td>\n",
" <td>Derek Jeter, the captain of the New York Yanke...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>S</td>\n",
" <td>Do you have a baseball or a football that we c...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Target_Subject TextField\n",
"0 A Bob has two dogs and one cat. The cat is bigg...\n",
"1 S Carmelo Anthony scored 42 points to lead the N...\n",
"2 S Come play baseball with us.\n",
"3 S Derek Jeter, the captain of the New York Yanke...\n",
"4 S Do you have a baseball or a football that we c..."
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"data=pd.read_csv('../input/WeatherAnimalsSports.csv')\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "052eb7c1a67d35a41651e7b4b7e20eb15ff35f3f"
},
"source": [
"When using pandas read files, we get a DataDrame variable, it manages data by columns and rows, quite similiar as excel sheet and SQL table.\n",
"We can select columns or rows from pandas DataFrame.\n",
"<p style=\"color:red\"> In python, index starts from 0, so first row is row 0, second row is row 1....<p>"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"_uuid": "92e44a0abc296697bdba00f984cd36b8dc3832a3"
},
"outputs": [
{
"data": {
"text/plain": [
"0 A\n",
"1 S\n",
"2 S\n",
"3 S\n",
"4 S\n",
"5 A\n",
"6 W\n",
"7 A\n",
"8 A\n",
"9 W\n",
"10 A\n",
"11 W\n",
"12 S\n",
"13 W\n",
"14 W\n",
"15 A\n",
"16 S\n",
"17 W\n",
"18 W\n",
"19 A\n",
"20 A\n",
"21 A\n",
"22 W\n",
"23 S\n",
"24 A\n",
"25 A\n",
"26 A\n",
"27 W\n",
"28 S\n",
"29 S\n",
"30 A\n",
"31 A\n",
"32 S\n",
"33 A\n",
"34 S\n",
"35 W\n",
"36 S\n",
"37 W\n",
"38 W\n",
"39 W\n",
"40 S\n",
"41 W\n",
"42 A\n",
"43 A\n",
"44 S\n",
"45 W\n",
"46 W\n",
"Name: Target_Subject, dtype: object"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['Target_Subject']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"_uuid": "70bbea2bf6cdd918c370e1b11bd7912ff71f1846"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Target_Subject</th>\n",
" <th>TextField</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>Bob has two dogs and one cat. The cat is bigg...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>S</td>\n",
" <td>Carmelo Anthony scored 42 points to lead the N...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>S</td>\n",
" <td>Come play baseball with us.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>S</td>\n",
" <td>Derek Jeter, the captain of the New York Yanke...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>S</td>\n",
" <td>Do you have a baseball or a football that we c...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Target_Subject TextField\n",
"0 A Bob has two dogs and one cat. The cat is bigg...\n",
"1 S Carmelo Anthony scored 42 points to lead the N...\n",
"2 S Come play baseball with us.\n",
"3 S Derek Jeter, the captain of the New York Yanke...\n",
"4 S Do you have a baseball or a football that we c..."
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[0:5]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"_uuid": "1b4962bc6f89739f031f0a2312c6d327ca93c1e6"
},
"outputs": [
{
"data": {
"text/plain": [
"'Bob has two dogs and one cat. The cat is bigger than either of the dogs.'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.loc[0,'TextField']"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"_uuid": "2a9be8e4f332ad29dd70811d6e7617822767fbd8"
},
"outputs": [
{
"data": {
"text/plain": [
"'Bob has two dogs and one cat. The cat is bigger than either of the dogs.'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[0,1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "77d74b69a623395e2d225033554f434f8ae8d2c2"
},
"source": [
"# For loop\n",
"A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).\n",
"\n",
"This is less like the for keyword in other programming language, and works more like an iterator method as found in other object-orientated programming languages.\n",
"\n",
"With the for loop we can execute a set of statements, once for each item in a list, tuple, set etc.\n",
"\n",
"<p style=\"color:red\">Automatic do same thing many times on different target<p>"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"_uuid": "51fff56b0cf98e280427d3c5bd6b5cbff4ea1d4f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"1\n",
"2\n"
]
}
],
"source": [
"j = 0\n",
"for i in range(3):\n",
" print(j)\n",
" j += 1"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"_uuid": "291bf20e5d0a0a42cc0b45fe751b4b348d0115e2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bob has two dogs and one cat. The cat is bigger than either of the dogs.\n",
"Carmelo Anthony scored 42 points to lead the NY Knicks basketball team to a win over the Florida Pelicans.\n",
"Come play baseball with us.\n"
]
}
],
"source": [
"for i in data['TextField'][:3]:\n",
" print(i)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "e9cbe79406f36a16538ca0141993de6bf5a8169d"
},
"source": [
"# List Comprehensions\n",
"Advanced loop syntax with a list"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"_uuid": "25ec4e06f368fb570a3bf44f3d5d26bc5823236c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Bob has two dogs and one cat. The cat is bigger than either of the dogs.', 'Carmelo Anthony scored 42 points to lead the NY Knicks basketball team to a win over the Florida Pelicans.', 'Come play baseball with us.']\n"
]
}
],
"source": [
"subtext = [i for i in data['TextField'][:3]]\n",
"print(subtext)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "6b833ae49058f70462f8a6dc94c087b0cd202e55"
},
"source": [
"# Text Parsing\n",
"Breaking texts into small component, eg. words, sentences.\n",
"Here I only show you one word parsing, not cover sentence parsing or big-gram parsing (multipe-words)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"_uuid": "aa733fbea332d38781ca373fca36fac6ade893cd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Bob has two dogs and one cat. The cat is bigger than either of the dogs.\n"
]
}
],
"source": [
"text1 = data.loc[0,'TextField']\n",
"print(text1)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"_uuid": "71cb2c8b171a8c6a618549740c69bb64da66d7cd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Bob', 'has', 'two', 'dogs', 'and', 'one', 'cat.', '', 'The', 'cat', 'is', 'bigger', 'than', 'either', 'of', 'the', 'dogs.']\n"
]
}
],
"source": [
"token1 = text1.split(' ') # split text by spaces\n",
"print(token1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "645676f8511f344d8dbc68cc1255db2c175789f9"
},
"source": [
"Now we get words inside the first record.\n",
"We need do the same thing for all the records."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"_uuid": "3ebb03c9a2262043d8607ad9067d115afd5e71a9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['Bob', 'has', 'two', 'dogs', 'and', 'one', 'cat.', '', 'The', 'cat', 'is', 'bigger', 'than', 'either', 'of', 'the', 'dogs.'], ['Carmelo', 'Anthony', 'scored', '42', 'points', 'to', 'lead', 'the', 'NY', 'Knicks', 'basketball', 'team', 'to', 'a', 'win', 'over', 'the', 'Florida', 'Pelicans.'], ['Come', 'play', 'baseball', 'with', 'us.'], ['Derek', 'Jeter,', 'the', 'captain', 'of', 'the', 'New', 'York', 'Yankees', 'baseball', 'team,', 'said', '2014', 'will', 'be', 'his', 'last', 'season', 'playing.'], ['Do', 'you', 'have', 'a', 'baseball', 'or', 'a', 'football', 'that', 'we', 'could', 'play', 'with?', '', 'You', 'can', 'be', 'on', 'my', 'team.']]\n"
]
}
],
"source": [
"tokens = [tx.split(' ') for tx in data.loc[:, 'TextField']] # use for-loop to tokenize every text in the file\n",
"print(tokens[:5]) # print the first 5 tokens"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "1b6c42c31daf40c3f79d4d7974fba265bd500607"
},
"source": [
"# Data cleaning\n",
"We need to clean the text data before the text mining.\n",
"1. Unify the cases (Turn Capitals into lowercase)\n",
"2. Delete stopwords\n",
"3. Delete punctuations and numbers\n",
"4. Stemming and lemmatization (not for here)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"_uuid": "8d6581579d24ea351342ff2e8b9ba1b4704d4d39"
},
"outputs": [],
"source": [
"from nltk.corpus import stopwords # use stop words from nltk library\n",
"stopword = stopwords.words(['english']) # define stopword"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"_uuid": "cdb51be70b32908373ce624f713476a477502bb5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['bob', 'two', 'dogs', 'one', 'cat', 'bigger', 'either'], ['carmelo', 'anthony', 'scored', 'points', 'lead', 'ny', 'knicks', 'basketball', 'team', 'win', 'florida'], ['come', 'play', 'baseball'], ['derek', 'captain', 'new', 'york', 'yankees', 'baseball', 'said', 'last', 'season'], ['baseball', 'football', 'could', 'play']]\n"
]
}
],
"source": [
"cleaned_tokens = [] # create a new list to store result\n",
"for token in tokens: # look through all the element in tokens\n",
" cleaned_token = [word.lower() for word in token] # lowercase\n",
" cleaned_token = [word for word in cleaned_token if word not in stopword] # delete stopword in each token\n",
" cleaned_token = [word for word in cleaned_token if word.isalpha()] # delte non alphabet word\n",
" cleaned_tokens.append(cleaned_token) # put each result into new list\n",
"print(cleaned_tokens[:5]) # check first 5 result"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "66d083512ec0125470efa782af3696603403660d"
},
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "2f51d1286ce6db13ae5c487762e3d1e8c0383e47"
},
"source": [
"# Vectorization (Manually way)\n",
"This step is to convert the tokens into numeric numbers.\n",
"\n",
"first, we need create a list contains every word in every document.\n",
"\n",
"then, we count the frequency of the appearance of each in document. (Or use binary exist or not)\n",
"\n",
"last, we use TF-IDF technique to convert the matrix. (See the notes from your 5P12 class)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "5bf53b1dbed5183c7d7c051701f44ad1ab1025f6"
},
"source": [
"## Create wordlist"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"_uuid": "2e807256f72725ab04d1727360ceba0855eeb242"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['bob', 'two', 'dogs', 'one', 'cat', 'bigger', 'either', 'carmelo', 'anthony', 'scored', 'points', 'lead', 'ny', 'knicks', 'basketball', 'team', 'win', 'florida', 'come', 'play', 'baseball', 'derek', 'captain', 'new', 'york', 'yankees', 'baseball', 'said', 'last', 'season', 'baseball', 'football', 'could', 'play', 'like', 'big', 'dogs', 'little', 'dogs', 'wonderful', 'sun', 'lower', 'sky', 'winter', 'days', 'colder', 'summer', 'house', 'cats', 'behave', 'much', 'like', 'big', 'tigers', 'efficient', 'friend', 'cats', 'true', 'animal', 'like', 'springtime', 'weather', 'hot', 'think', 'animals', 'spots', 'like', 'leopards', 'especially', 'think', 'prefer', 'hot', 'weather', 'cold', 'like', 'go', 'beach', 'hot', 'used', 'play', 'little', 'league', 'baseball', 'basketball', 'rains', 'go', 'also', 'supposed', 'pretty', 'rain', 'still', 'going', 'let', 'weather', 'stop', 'visit', 'look', 'preference', 'national', 'basketball', 'three', 'among', 'several', 'sons', 'former', 'jack', 'mary', 'could', 'go', 'picnic', 'bad', 'rescheduled', 'next', 'sunday', 'warm', 'jack', 'likes', 'snow', 'ice', 'like', 'hot', 'weather', 'john', 'went', 'zoo', 'saw', 'elephants', 'lions', 'usually', 'little', 'smaller', 'jaguars', 'leopards', 'big', 'cats', 'smaller', 'lions', 'mary', 'likes', 'watch', 'animal', 'documentaries', 'especially', 'fond', 'watching', 'shows', 'big', 'snow', 'predicted', 'favorite', 'baseball', 'player', 'times', 'willie', 'favorite', 'zoo', 'bronx', 'usually', 'go', 'see', 'polar', 'bears', 'first', 'go', 'lions', 'favorite', 'zoo', 'san', 'diego', 'love', 'watch', 'monkeys', 'orca', 'whales', 'prey', 'seals', 'lions', 'prey', 'bears', 'prey', 'antelope', 'omnivorous', 'fourth', 'hottest', 'day', 'record', 'temperature', 'reached', 'summers', 'brutal', 'goals', 'gave', 'bayern', 'munich', 'soccer', 'team', 'victory', 'first', 'game', 'champion', 'league', 'ted', 'williams', 'last', 'person', 'batting', 'average', 'higher', 'among', 'major', 'league', 'baseball', 'biggest', 'animal', 'world', 'blue', 'honey', 'badger', 'small', 'ferocious', 'animal', 'defend', 'bears', 'japanese', 'masahiro', 'signed', 'big', 'contract', 'play', 'baseball', 'new', 'york', 'yankees', 'lions', 'near', 'zebras', 'bronx', 'number', 'one', 'rated', 'syracuse', 'university', 'basketball', 'team', 'lost', 'boston', 'college', 'score', 'snow', 'fell', 'heavily', 'rain', 'came', 'washed', 'soccer', 'team', 'beat', 'russian', 'team', 'hockey', 'sochi', 'score', 'weather', 'nyc', 'hot', 'summer', 'cold', 'get', 'much', 'snow', 'much', 'could', 'go', 'also', 'difficult', 'much', 'colder', 'usual', 'lots', 'ice', 'ucla', 'ncaa', 'basketball', 'championship', 'many', 'snow', 'snow', 'days', 'went', 'zoo', 'saw', 'elephants', 'get', 'chance', 'see', 'like', 'hot', 'arizona', 'desert', 'interesting', 'lots', 'unusual', 'animals', 'living', 'including', 'wildcats', 'sport', 'like', 'football', 'winter', 'days', 'often', 'even', 'winter', 'favorite', 'love', 'cold']\n"
]
}
],
"source": [
"wordlists = [] # creat a empty list for storing the result\n",
"for t in cleaned_tokens: # look through all the element in tokens\n",
" wordlists += t # add every token into list\n",
"print(wordlists)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"_uuid": "db53264945b8eb5b26de80f9020d7b98f936877d"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"317\n"
]
}
],
"source": [
"print(len(wordlists)) # check how many words in total, we have duplicates in the list which need to be deleted"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"_uuid": "5127bd4d673ac717a0715108d58e7e1ade1fc8d3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"201\n"
]
}
],
"source": [
"wordlist = list(set(wordlists)) # remove duplicate words from wordlist\n",
"print(len(wordlist)) # check words number after removing duplicate"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "55740d925d961056b0c95638b8d0dcbccbbf135b"
},
"source": [
"## Count frequency of each word in each document"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"_uuid": "6c34702fd53317688164259f75cabd34d4b1bf43"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"47\n",
"201\n",
"47\n",
"0\n",
"['one', 'go', 'honey', 'come', 'john', 'football', 'summers', 'dogs', 'sky', 'mary', 'sun', 'soccer', 'temperature', 'play', 'fourth', 'points', 'champion', 'new', 'came', 'hot', 'anthony', 'national', 'york', 'often', 'beach', 'prey', 'lots', 'williams', 'japanese', 'bayern', 'could', 'usually', 'contract', 'many', 'going', 'love', 'bears', 'orca', 'visit', 'badger', 'predicted', 'average', 'usual', 'house', 'saw', 'championship', 'leopards', 'still', 'florida', 'season', 'rescheduled', 'ted', 'near', 'summer', 'gave', 'small', 'game', 'watch', 'brutal', 'baseball', 'friend', 'ny', 'batting', 'bad', 'cold', 'hottest', 'cats', 'used', 'syracuse', 'snow', 'spots', 'monkeys', 'former', 'antelope', 'person', 'times', 'zebras', 'scored', 'also', 'including', 'interesting', 'win', 'sport', 'rated', 'washed', 'jaguars', 'among', 'supposed', 'player', 'sons', 'animals', 'preference', 'prefer', 'even', 'ucla', 'elephants', 'springtime', 'get', 'several', 'fell', 'much', 'tigers', 'documentaries', 'said', 'rain', 'look', 'bigger', 'pretty', 'two', 'basketball', 'unusual', 'last', 'omnivorous', 'nyc', 'team', 'college', 'heavily', 'either', 'winter', 'goals', 'number', 'lower', 'signed', 'higher', 'see', 'stop', 'willie', 'likes', 'watching', 'score', 'record', 'wonderful', 'russian', 'derek', 'next', 'biggest', 'defend', 'diego', 'think', 'san', 'yankees', 'blue', 'university', 'efficient', 'behave', 'warm', 'wildcats', 'bob', 'days', 'favorite', 'fond', 'went', 'boston', 'living', 'chance', 'carmelo', 'ncaa', 'sunday', 'animal', 'zoo', 'major', 'world', 'shows', 'victory', 'sochi', 'league', 'reached', 'lead', 'picnic', 'lions', 'difficult', 'especially', 'desert', 'jack', 'little', 'seals', 'captain', 'whales', 'cat', 'knicks', 'lost', 'three', 'munich', 'weather', 'rains', 'ferocious', 'day', 'first', 'beat', 'let', 'true', 'hockey', 'big', 'polar', 'arizona', 'smaller', 'like', 'ice', 'masahiro', 'bronx', 'colder']\n",
"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
"['winter', 'favorite', 'love', 'cold']\n"
]
}
],
"source": [
"wordcounts = [] # creat a empty list for storing the whole result\n",
"for token in cleaned_tokens: # look through all the element in tokens\n",
" wordcount = [] # creat a empty list for storing each result temparorily, notice every loop, this list will be emptified\n",
" for word in wordlist: # look through all the element in wordlist\n",
" count = token.count(word)\n",
" wordcount.append(count)\n",
" wordcounts.append(wordcount)\n",
"print(len(wordcounts))\n",
"print(len(wordcounts[0]))\n",
"print(len(cleaned_tokens))\n",
"print(count)\n",
"print(wordlist)\n",
"print(wordcount)\n",
"print(cleaned_tokens[-1])\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"_uuid": "e5ac55e101a6ea42e89f5d6c589b2ea04d7e2d75"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>go</th>\n",
" <th>honey</th>\n",
" <th>come</th>\n",
" <th>john</th>\n",
" <th>football</th>\n",
" <th>summers</th>\n",
" <th>dogs</th>\n",
" <th>sky</th>\n",
" <th>mary</th>\n",
" <th>sun</th>\n",
" <th>soccer</th>\n",
" <th>temperature</th>\n",
" <th>play</th>\n",
" <th>fourth</th>\n",
" <th>points</th>\n",
" <th>champion</th>\n",
" <th>new</th>\n",
" <th>came</th>\n",
" <th>hot</th>\n",
" <th>anthony</th>\n",
" <th>national</th>\n",
" <th>york</th>\n",
" <th>often</th>\n",
" <th>beach</th>\n",
" <th>prey</th>\n",
" <th>lots</th>\n",
" <th>williams</th>\n",
" <th>japanese</th>\n",
" <th>bayern</th>\n",
" <th>could</th>\n",
" <th>usually</th>\n",
" <th>contract</th>\n",
" <th>many</th>\n",
" <th>going</th>\n",
" <th>love</th>\n",
" <th>bears</th>\n",
" <th>orca</th>\n",
" <th>visit</th>\n",
" <th>badger</th>\n",
" <th>...</th>\n",
" <th>world</th>\n",
" <th>shows</th>\n",
" <th>victory</th>\n",
" <th>sochi</th>\n",
" <th>league</th>\n",
" <th>reached</th>\n",
" <th>lead</th>\n",
" <th>picnic</th>\n",
" <th>lions</th>\n",
" <th>difficult</th>\n",
" <th>especially</th>\n",
" <th>desert</th>\n",
" <th>jack</th>\n",
" <th>little</th>\n",
" <th>seals</th>\n",
" <th>captain</th>\n",
" <th>whales</th>\n",
" <th>cat</th>\n",
" <th>knicks</th>\n",
" <th>lost</th>\n",
" <th>three</th>\n",
" <th>munich</th>\n",
" <th>weather</th>\n",
" <th>rains</th>\n",
" <th>ferocious</th>\n",
" <th>day</th>\n",
" <th>first</th>\n",
" <th>beat</th>\n",
" <th>let</th>\n",
" <th>true</th>\n",
" <th>hockey</th>\n",
" <th>big</th>\n",
" <th>polar</th>\n",
" <th>arizona</th>\n",
" <th>smaller</th>\n",
" <th>like</th>\n",
" <th>ice</th>\n",
" <th>masahiro</th>\n",
" <th>bronx</th>\n",
" <th>colder</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one go honey come john ... like ice masahiro bronx colder\n",
"0 1 0 0 0 0 ... 0 0 0 0 0\n",
"1 0 0 0 0 0 ... 0 0 0 0 0\n",
"2 0 0 0 1 0 ... 0 0 0 0 0\n",
"3 0 0 0 0 0 ... 0 0 0 0 0\n",
"4 0 0 0 0 0 ... 0 0 0 0 0\n",
"\n",
"[5 rows x 201 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wordmatrix = pd.DataFrame(data=wordcounts, columns=wordlist) # creat a dataframe to help you look the result\n",
"wordmatrix.head() # show first 5 documents"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "9681f481bf23f8551b05869e38904488abd85194"
},
"source": [
"## Calculae TF-IDF\n",
"\n",
"Detailed explaination see [wikipeida](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)\n",
"\n",
"![](https://www.researchgate.net/profile/Heloisa_Rocha/publication/221228354/figure/fig2/AS:650816818003985@1532178229971/TF-IDF-formula-2.png)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"_uuid": "8553c74d9389d4582edc92ff2a6221e19a68f428"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>go</th>\n",
" <th>honey</th>\n",
" <th>come</th>\n",
" <th>john</th>\n",
" <th>football</th>\n",
" <th>summers</th>\n",
" <th>dogs</th>\n",
" <th>sky</th>\n",
" <th>mary</th>\n",
" <th>sun</th>\n",
" <th>soccer</th>\n",
" <th>temperature</th>\n",
" <th>play</th>\n",
" <th>fourth</th>\n",
" <th>points</th>\n",
" <th>champion</th>\n",
" <th>new</th>\n",
" <th>came</th>\n",
" <th>hot</th>\n",
" <th>anthony</th>\n",
" <th>national</th>\n",
" <th>york</th>\n",
" <th>often</th>\n",
" <th>beach</th>\n",
" <th>prey</th>\n",
" <th>lots</th>\n",
" <th>williams</th>\n",
" <th>japanese</th>\n",
" <th>bayern</th>\n",
" <th>could</th>\n",
" <th>usually</th>\n",
" <th>contract</th>\n",
" <th>many</th>\n",
" <th>going</th>\n",
" <th>love</th>\n",
" <th>bears</th>\n",
" <th>orca</th>\n",
" <th>visit</th>\n",
" <th>badger</th>\n",
" <th>...</th>\n",
" <th>shows</th>\n",
" <th>victory</th>\n",
" <th>sochi</th>\n",
" <th>league</th>\n",
" <th>reached</th>\n",
" <th>lead</th>\n",
" <th>picnic</th>\n",
" <th>lions</th>\n",
" <th>difficult</th>\n",
" <th>especially</th>\n",
" <th>desert</th>\n",
" <th>jack</th>\n",
" <th>little</th>\n",
" <th>seals</th>\n",
" <th>captain</th>\n",
" <th>whales</th>\n",
" <th>cat</th>\n",
" <th>knicks</th>\n",
" <th>lost</th>\n",
" <th>three</th>\n",
" <th>munich</th>\n",
" <th>weather</th>\n",
" <th>rains</th>\n",
" <th>ferocious</th>\n",
" <th>day</th>\n",
" <th>first</th>\n",
" <th>beat</th>\n",
" <th>let</th>\n",
" <th>true</th>\n",
" <th>hockey</th>\n",
" <th>big</th>\n",
" <th>polar</th>\n",
" <th>arizona</th>\n",
" <th>smaller</th>\n",
" <th>like</th>\n",
" <th>ice</th>\n",
" <th>masahiro</th>\n",
" <th>bronx</th>\n",
" <th>colder</th>\n",
" <th>row_total</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one go honey come ... masahiro bronx colder row_total\n",
"0 1 0 0 0 ... 0 0 0 7\n",
"1 0 0 0 0 ... 0 0 0 11\n",
"2 0 0 0 1 ... 0 0 0 3\n",
"3 0 0 0 0 ... 0 0 0 9\n",
"4 0 0 0 0 ... 0 0 0 4\n",
"\n",
"[5 rows x 202 columns]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wordmatrix['row_total'] = wordmatrix.aggregate('sum',axis=1) # add a sum column (total number of words in each document)\n",
"wordmatrix.head()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"_uuid": "9e6de1b8ce57c12e01dd14ca4981fb842863781c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"one 2\n",
"go 5\n",
"honey 1\n",
"come 1\n",
"john 1\n",
"football 2\n",
"summers 1\n",
"dogs 2\n",
"sky 1\n",
"mary 2\n",
"sun 1\n",
"soccer 2\n",
"temperature 1\n",
"play 4\n",
"fourth 1\n",
"points 1\n",
"champion 1\n",
"new 2\n",
"came 1\n",
"hot 5\n",
"anthony 1\n",
"national 1\n",
"york 2\n",
"often 1\n",
"beach 1\n",
"prey 1\n",
"lots 2\n",
"williams 1\n",
"japanese 1\n",
"bayern 1\n",
" ..\n",
"desert 1\n",
"jack 2\n",
"little 3\n",
"seals 1\n",
"captain 1\n",
"whales 1\n",
"cat 1\n",
"knicks 1\n",
"lost 1\n",
"three 1\n",
"munich 1\n",
"weather 5\n",
"rains 1\n",
"ferocious 1\n",
"day 1\n",
"first 2\n",
"beat 1\n",
"let 1\n",
"true 1\n",
"hockey 1\n",
"big 5\n",
"polar 1\n",
"arizona 1\n",
"smaller 1\n",
"like 8\n",
"ice 2\n",
"masahiro 1\n",
"bronx 2\n",
"colder 2\n",
"row_total 47\n",
"Length: 202, dtype: int64\n"
]
}
],
"source": [
"N = len(wordmatrix)\n",
"n = wordmatrix.astype('bool').sum() \n",
"print(n)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"_uuid": "665c6b8dec2442fea2b1f96a110193c0b5535c2a"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>go</th>\n",
" <th>honey</th>\n",
" <th>come</th>\n",
" <th>john</th>\n",
" <th>football</th>\n",
" <th>summers</th>\n",
" <th>dogs</th>\n",
" <th>sky</th>\n",
" <th>mary</th>\n",
" <th>sun</th>\n",
" <th>soccer</th>\n",
" <th>temperature</th>\n",
" <th>play</th>\n",
" <th>fourth</th>\n",
" <th>points</th>\n",
" <th>champion</th>\n",
" <th>new</th>\n",
" <th>came</th>\n",
" <th>hot</th>\n",
" <th>anthony</th>\n",
" <th>national</th>\n",
" <th>york</th>\n",
" <th>often</th>\n",
" <th>beach</th>\n",
" <th>prey</th>\n",
" <th>lots</th>\n",
" <th>williams</th>\n",
" <th>japanese</th>\n",
" <th>bayern</th>\n",
" <th>could</th>\n",
" <th>usually</th>\n",
" <th>contract</th>\n",
" <th>many</th>\n",
" <th>going</th>\n",
" <th>love</th>\n",
" <th>bears</th>\n",
" <th>orca</th>\n",
" <th>visit</th>\n",
" <th>badger</th>\n",
" <th>...</th>\n",
" <th>shows</th>\n",
" <th>victory</th>\n",
" <th>sochi</th>\n",
" <th>league</th>\n",
" <th>reached</th>\n",
" <th>lead</th>\n",
" <th>picnic</th>\n",
" <th>lions</th>\n",
" <th>difficult</th>\n",
" <th>especially</th>\n",
" <th>desert</th>\n",
" <th>jack</th>\n",
" <th>little</th>\n",
" <th>seals</th>\n",
" <th>captain</th>\n",
" <th>whales</th>\n",
" <th>cat</th>\n",
" <th>knicks</th>\n",
" <th>lost</th>\n",
" <th>three</th>\n",
" <th>munich</th>\n",
" <th>weather</th>\n",
" <th>rains</th>\n",
" <th>ferocious</th>\n",
" <th>day</th>\n",
" <th>first</th>\n",
" <th>beat</th>\n",
" <th>let</th>\n",
" <th>true</th>\n",
" <th>hockey</th>\n",
" <th>big</th>\n",
" <th>polar</th>\n",
" <th>arizona</th>\n",
" <th>smaller</th>\n",
" <th>like</th>\n",
" <th>ice</th>\n",
" <th>masahiro</th>\n",
" <th>bronx</th>\n",
" <th>colder</th>\n",
" <th>row_total</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.195867</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.195867</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.238871</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.152009</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.152009</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.152009</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.152009</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.557366</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.356679</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.152341</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.152341</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.185789</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.342767</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.267509</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.298744</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one go honey come ... masahiro bronx colder row_total\n",
"0 0.195867 0.0 0.0 0.000000 ... 0.0 0.0 0.0 7\n",
"1 0.000000 0.0 0.0 0.000000 ... 0.0 0.0 0.0 11\n",
"2 0.000000 0.0 0.0 0.557366 ... 0.0 0.0 0.0 3\n",
"3 0.000000 0.0 0.0 0.000000 ... 0.0 0.0 0.0 9\n",
"4 0.000000 0.0 0.0 0.000000 ... 0.0 0.0 0.0 4\n",
"\n",
"[5 rows x 202 columns]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import math\n",
"\n",
"for row in range(len(wordmatrix)): # go through every row\n",
" for col in wordmatrix.columns[:-1]: # go through every column exclude 'row_total'\n",
" wordmatrix.loc[row,col] = wordmatrix.loc[row,col]/wordmatrix.loc[row,'row_total']*math.log10(N/n[col])\n",
" \n",
"wordmatrix.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "a9c0e9b2fbc106989d150b9b95c8f9c36cbe436e"
},
"source": [
"# SVD decomposition\n",
"![](https://intoli.com/blog/pca-and-svd/img/svd-matrices.png)\n",
"\n",
"See details [here](https://intoli.com/blog/pca-and-svd/)\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"_uuid": "e9320ed0f6dcf82f3cee38751893de9b6a15e21b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 4.79734445e-04 4.19718318e-17 2.25537917e-02 -6.35396284e-04\n",
" 6.67179918e-03 -3.14859129e-02 3.31740791e-03 2.49238897e-02\n",
" 5.52374086e-04 1.32667892e-01 -6.10798188e-02 3.40478534e-02\n",
" 6.90012430e-03 3.90459107e-01 7.63479822e-02]\n",
" [ 1.23071776e-05 -6.45510807e-17 6.27306555e-03 -1.07304571e-03\n",
" -5.16765618e-04 9.92244253e-03 1.77652577e-04 2.29876942e-03\n",
" -4.06045579e-04 4.23227228e-02 4.20798427e-02 -2.93015268e-04\n",
" 2.76891379e-03 1.64908620e-03 1.74574722e-03]\n",
" [ 1.11170821e-03 -2.22183200e-18 4.09446438e-01 -6.23984802e-02\n",
" -4.15143837e-02 4.68492157e-01 -3.10438782e-02 2.95796547e-02\n",
" -7.72699665e-03 3.22160500e-02 -1.03672890e-01 -3.47733310e-02\n",
" -1.24971296e-01 -6.83408578e-03 -5.44631707e-02]\n",
" [ 1.87371786e-04 4.44273841e-18 6.10497093e-02 -7.33084300e-03\n",
" -5.79224381e-03 7.13838454e-02 -2.82280330e-03 5.82949980e-03\n",
" 1.22134889e-03 1.08355540e-02 -2.03278591e-02 -4.53271197e-03\n",
" -3.29125689e-03 4.24911671e-03 -1.44324067e-03]\n",
" [ 3.39538555e-03 9.82897390e-17 4.21252441e-01 -5.54303822e-02\n",
" -2.01675899e-02 8.20761882e-02 -1.97373246e-02 -1.81705505e-02\n",
" -1.99620614e-02 -8.68637451e-02 4.64243990e-02 2.00011915e-02\n",
" -3.59486588e-02 -2.59901558e-02 1.31090103e-02]\n",
" [ 4.33288559e-03 -2.33115532e-17 9.56751234e-02 -1.81855913e-03\n",
" 2.17746398e-02 -9.00107325e-02 8.14354768e-03 5.46757526e-02\n",
" 1.20680687e-03 1.93048192e-01 -1.00053801e-01 4.46485574e-02\n",
" 7.70877488e-03 4.10973570e-01 4.30640137e-02]\n",
" [ 6.50521963e-02 -1.32792350e-17 2.11267258e-02 2.61948930e-01\n",
" -4.44314095e-02 9.44903631e-04 -2.93710479e-02 -2.96450104e-03\n",
" -3.10775559e-02 -7.28361928e-03 1.25058528e-02 1.03713585e-02\n",
" -3.20719737e-02 1.03526045e-02 -4.69694474e-02]\n",
" [ 5.66161933e-03 2.07236325e-17 5.94845876e-02 1.45159918e-02\n",
" 5.43701218e-02 -6.28140051e-02 3.45673591e-02 -1.90641445e-02\n",
" -9.83576036e-03 7.30583073e-02 -5.68972454e-02 9.02227527e-02\n",
" -4.37736754e-03 1.81974048e-02 -3.63949391e-02]\n",
" [ 6.39782309e-04 -7.65710956e-19 2.45619304e-02 6.56620464e-02\n",
" 4.13248053e-01 1.29615085e-02 4.03213652e-03 1.31520960e-02\n",
" -1.39623905e-02 1.25964193e-01 -1.28414444e-01 4.95251674e-01\n",
" -2.37792497e-02 -1.71796966e-01 -2.43499294e-02]\n",
" [ 2.58921427e-02 4.59616194e-17 1.12892492e-01 2.43979599e-02\n",
" 1.06510534e-02 -2.01213166e-01 6.50024727e-03 -8.99787015e-03\n",
" 3.06557708e-02 2.06491687e-01 -1.75894171e-01 -1.44686042e-01\n",
" -5.00537360e-03 -6.50627033e-02 -1.10259868e-01]\n",
" [ 4.80637501e-03 8.80034870e-17 6.50249321e-02 6.06410801e-03\n",
" 1.75003515e-02 -1.02512078e-01 6.64518738e-03 2.08165477e-02\n",
" 7.17964482e-03 7.96597964e-02 -6.17472691e-02 -1.70367683e-02\n",
" 2.00061728e-02 2.34172674e-02 -5.59116649e-02]\n",
" [ 1.89380939e-02 8.49878291e-17 7.57203502e-02 4.36133565e-02\n",
" 3.95908798e-03 -1.14187104e-01 5.00165204e-02 -4.94474917e-02\n",
" 1.70882884e-02 1.16653274e-01 -1.02895305e-01 -8.33319701e-02\n",
" 4.16598802e-02 -3.83964456e-02 -6.49465485e-02]\n",
" [ 6.81950036e-04 -1.64078111e-17 1.71080949e-01 -2.29857022e-02\n",
" -1.09651748e-02 1.64285091e-01 -2.27630533e-03 2.44499574e-02\n",
" -4.90975826e-03 9.52605618e-02 1.62127419e-02 1.17631782e-03\n",
" -2.79945251e-02 5.48503399e-02 -2.50436219e-03]\n",
" [ 2.28846076e-03 2.75472236e-16 8.26856678e-02 2.69811104e-02\n",
" 7.74061714e-03 -4.89979581e-02 3.25475884e-01 -4.22586400e-01\n",
" -1.18044652e-01 -6.42084582e-02 7.87866761e-02 2.52684858e-02\n",
" 3.92045694e-03 4.88569704e-02 5.37758950e-02]\n",
" [ 1.98154089e-02 3.39091984e-17 2.24635577e-02 1.59537549e-02\n",
" 2.01932333e-03 -6.55504661e-02 1.07563355e-02 -1.65208969e-02\n",
" 1.41240247e-02 1.72328069e-01 -1.60849559e-01 -1.71973675e-01\n",
" -1.07090272e-01 -1.74325463e-01 1.26391785e-01]\n",
" [ 1.50883679e-18 9.65386148e-01 -1.90906504e-16 -2.14482908e-17\n",
" -9.56328021e-18 1.03311002e-16 -2.81217342e-16 1.31966764e-16\n",
" 6.59744017e-17 1.76247594e-17 4.97939138e-18 6.07041853e-18\n",
" -1.19805913e-17 2.90961643e-18 -4.69708318e-18]\n",
" [ 2.43453534e-05 -1.41562841e-17 1.41747778e-02 -2.54853288e-03\n",
" -1.38704714e-03 2.43540008e-02 -4.45614215e-04 5.03025312e-03\n",
" -9.43301541e-04 9.83609379e-02 9.94251589e-02 -1.77340729e-03\n",
" 6.58321590e-03 -4.69576581e-03 6.23702813e-04]\n",
" [ 4.61566074e-03 1.01985743e-16 5.55138592e-02 4.79338808e-03\n",
" 4.26355607e-03 -1.20067341e-02 6.90508559e-02 -8.60624028e-02\n",
" -2.18931236e-02 -6.30514989e-03 4.70648638e-03 -1.46946407e-03\n",
" -6.22854592e-03 -2.98463990e-03 -4.59695851e-03]\n",
" [ 1.27138111e-01 4.26794087e-17 6.31101389e-02 2.30145296e-02\n",
" 8.61182648e-03 -1.08363399e-01 1.06945316e-02 -1.84503109e-02\n",
" 8.52880786e-03 1.16490987e-01 -9.73746245e-02 -7.86936006e-02\n",
" -3.58523809e-02 -4.09475906e-02 -5.60018381e-02]\n",
" [ 1.37477733e-03 1.15052982e-18 1.02993594e-02 4.12131716e-02\n",
" 6.68844246e-04 -2.30789854e-03 1.44315055e-01 5.49440323e-03\n",
" 5.22108399e-01 -4.80109142e-02 6.70591385e-02 3.77655150e-02\n",
" -1.59468099e-01 3.11728829e-02 -1.07398300e-02]\n",
" [ 7.28056372e-04 3.44092554e-17 3.35860442e-02 1.54189363e-02\n",
" 5.53105851e-02 -1.32288632e-02 1.22806262e-01 1.08736036e-01\n",
" -3.47591985e-02 6.29145529e-02 -4.16246629e-02 7.90085729e-02\n",
" -6.14392737e-03 5.50622756e-02 1.48788067e-02]\n",
" [ 4.07133276e-03 3.26250252e-17 1.90517789e-02 1.97741632e-02\n",
" 7.84027285e-02 -1.15300493e-02 7.47143841e-03 -1.00804231e-04\n",
" 1.15779468e-02 3.71396465e-02 -3.21780259e-02 1.83017322e-02\n",
" 2.00852215e-02 1.40654713e-02 -5.93581156e-03]\n",
" [ 8.78802458e-01 -2.85138037e-17 -4.37832846e-02 -2.53432194e-01\n",
" 3.89806140e-02 4.13699305e-02 3.10917341e-02 5.56434533e-03\n",
" 3.49348540e-02 -5.61061789e-02 3.56835412e-02 5.01282858e-02\n",
" 1.71882059e-01 4.55619171e-02 -4.95116153e-02]\n",
" [ 2.35741894e-03 1.31619439e-17 1.30645313e-01 6.14893009e-02\n",
" -2.34083770e-02 1.64888277e-01 2.81073969e-02 1.73740128e-02\n",
" 7.45512320e-02 2.43107064e-02 -9.28690000e-02 -4.31289087e-02\n",
" 3.21914795e-01 -6.54799207e-02 8.77039323e-02]\n",
" [ 1.91111880e-03 1.00614500e-16 3.43738854e-02 4.12821174e-02\n",
" 1.40892005e-02 -5.19973092e-03 1.79874980e-01 1.99984408e-02\n",
" 3.50400733e-02 -1.37836883e-02 6.90042793e-03 -4.68809335e-03\n",
" 3.95394448e-02 -4.87109435e-03 2.93799542e-02]\n",
" [ 3.57279059e-03 3.56531982e-17 2.66313841e-02 1.03352491e-01\n",
" -9.08708565e-03 2.50555062e-02 5.83212007e-02 8.00165445e-03\n",
" 1.60828228e-01 4.98084944e-03 -3.48377955e-02 -1.29758535e-02\n",
" 2.37401638e-01 -3.27854931e-02 3.90591953e-02]\n",
" [ 9.64770678e-05 4.21069308e-17 6.97889166e-03 1.14727754e-02\n",
" 2.42895809e-02 -3.57338091e-03 1.27475779e-01 1.11106474e-01\n",
" -3.84181320e-02 -7.91693558e-03 2.72391640e-03 7.66554663e-03\n",
" -2.98983674e-03 -7.70430438e-03 4.92796797e-01]\n",
" [-4.55398219e-17 -4.55540511e-17 1.18614785e-15 -6.33915994e-14\n",
" -3.42625006e-14 9.24508662e-15 -1.04873261e-12 -6.08922839e-12\n",
" -3.15862151e-12 1.52676962e-10 -3.18901955e-10 3.95876986e-10\n",
" -2.40383434e-09 5.81940232e-09 1.33688218e-07]\n",
" [ 4.87289331e-05 -1.20364334e-17 9.35113172e-03 1.10335332e-04\n",
" -1.88357642e-04 1.12004406e-02 7.53211684e-03 2.94992145e-03\n",
" 1.36918517e-03 1.38604846e-02 7.05076168e-03 -1.54967539e-04\n",
" 5.15106831e-04 9.13733566e-03 4.17644328e-03]\n",
" [ 1.41424822e-04 -4.50080293e-18 4.60403090e-02 -5.47406182e-03\n",
" -4.58553910e-03 5.35570123e-02 -1.82291777e-03 4.84835937e-03\n",
" 9.35970848e-04 1.93923335e-02 -9.65518612e-04 -3.90373142e-03\n",
" -3.35980085e-04 2.65904958e-03 3.61021918e-04]\n",
" [ 2.93941281e-04 -2.00125284e-17 1.67786657e-02 9.82434102e-02\n",
" 6.52292468e-01 6.30281521e-02 -8.76881135e-02 -3.69953930e-02\n",
" 1.25527509e-02 -1.09594788e-01 1.03060645e-01 -3.31709366e-01\n",
" 8.32871762e-03 9.45490495e-02 -1.52217151e-02]\n",
" [ 1.70499409e-04 9.40300127e-18 6.75700616e-03 2.49274913e-02\n",
" 1.44422761e-01 1.00352991e-02 1.06661382e-02 7.39925120e-03\n",
" -1.05353086e-05 5.23971524e-03 -8.82162589e-03 4.91720685e-02\n",
" 2.71271318e-03 -2.62823678e-02 1.13728050e-01]\n",
" [ 4.82651923e-04 2.54078873e-18 1.05707224e-01 -1.35211992e-02\n",
" -5.00266786e-03 1.02277932e-01 -2.35970253e-03 1.10880971e-02\n",
" -2.13218560e-03 2.27841807e-02 -3.13866713e-02 2.07471188e-03\n",
" -2.06732652e-02 2.60511830e-02 -6.19679344e-03]\n",
" [ 2.77671412e-04 1.59557995e-16 2.12604727e-02 3.34024841e-02\n",
" 4.19714480e-02 -1.66164068e-02 4.95042824e-01 4.54590193e-01\n",
" -1.69132459e-01 -6.62376155e-02 4.12784532e-02 -7.72388813e-02\n",
" -3.52471449e-02 -3.25089392e-02 -1.38444903e-01]\n",
" [ 2.41758900e-05 -3.16647950e-18 7.17319219e-03 -1.08100326e-03\n",
" -1.50238235e-04 7.92268071e-03 4.51719681e-04 3.93841973e-03\n",
" -3.40086384e-04 5.22146823e-02 3.61960326e-02 2.65008043e-03\n",
" 3.33378576e-03 3.86191282e-02 1.06269680e-02]\n",
" [ 1.72492851e-01 -2.76688167e-17 1.12589322e-03 1.39650558e-02\n",
" -1.66411632e-03 -1.85819605e-02 -2.78396150e-03 -5.59873633e-03\n",
" -6.03207163e-03 9.12453595e-02 -8.25101785e-02 -1.06530862e-01\n",
" -1.66658880e-01 -1.33140420e-01 2.12521193e-01]\n",
" [ 4.35537320e-06 -2.51126219e-17 2.12888144e-03 -2.08387607e-04\n",
" -8.46057156e-05 3.79589543e-03 1.36740367e-03 1.50159958e-03\n",
" 1.60817899e-04 2.15010141e-02 1.66236924e-02 6.27508868e-04\n",
" 1.50746230e-03 1.50109985e-02 6.73739251e-03]\n",
" [ 1.15537182e-01 4.62361720e-17 4.20969365e-02 6.79795357e-02\n",
" -3.19155522e-03 -6.26779126e-02 4.45002191e-02 -4.70794585e-02\n",
" 3.95276832e-02 9.31977853e-02 -8.32485353e-02 -6.77576718e-02\n",
" -4.43001612e-03 -5.04808760e-02 -4.90188439e-02]\n",
" [ 7.80507357e-03 2.61134403e-16 1.48966223e-01 2.31193165e-02\n",
" 6.63598775e-03 -3.67004433e-02 2.53828247e-01 -3.25840266e-01\n",
" -8.66403865e-02 -3.73820877e-02 3.84320487e-02 1.95974706e-02\n",
" -1.43009915e-02 1.07866786e-02 -9.30255953e-03]\n",
" [ 1.45455798e-02 1.17724666e-17 3.55234505e-02 5.08049223e-02\n",
" 1.84724380e-04 -4.26797013e-02 5.49387737e-02 -8.32614105e-02\n",
" -2.14320512e-02 5.51743312e-02 -4.42648700e-02 -1.80396968e-02\n",
" -2.95235951e-02 -1.52347151e-02 -1.20052216e-01]\n",
" [ 3.32302382e-05 -2.62913554e-17 2.27598479e-02 -4.76468450e-03\n",
" -2.34642305e-03 4.86091607e-02 -7.31719485e-04 1.31805175e-02\n",
" -3.37201785e-03 4.37818887e-01 5.07414077e-01 -7.59553645e-03\n",
" 5.26934495e-02 -8.64865179e-02 -2.17631521e-03]\n",
" [ 5.65063653e-01 2.69891681e-17 4.12850726e-03 2.00391235e-01\n",
" -3.56669485e-02 -2.83645551e-03 -4.44758701e-02 7.59732786e-03\n",
" -5.12428388e-02 1.18624156e-02 7.18997690e-03 -1.59627040e-02\n",
" -2.01330957e-01 -1.73341364e-02 4.95951230e-02]\n",
" [ 4.78425874e-03 8.37943590e-18 1.12768605e-02 3.69212002e-02\n",
" 9.53501473e-04 -6.39144813e-03 1.21136224e-01 1.57665653e-03\n",
" 3.98299131e-01 -2.87536750e-02 4.35303096e-02 2.18049568e-02\n",
" -1.18857287e-01 1.81886696e-02 -1.15582426e-02]\n",
" [ 6.24014613e-03 2.37868399e-17 3.90765137e-02 8.00323882e-03\n",
" 4.32029879e-03 -6.62163676e-02 3.41837513e-03 -3.81463034e-03\n",
" 6.11360740e-03 5.80347113e-02 -4.73851515e-02 -3.13801462e-02\n",
" 3.27010349e-03 -4.39536954e-03 -5.50643919e-02]\n",
" [ 1.08345216e-02 7.75348174e-17 5.01772624e-01 -7.02707496e-02\n",
" 5.88236903e-03 -4.39562166e-01 -1.67229040e-01 1.28382704e-01\n",
" 2.01351238e-02 -1.70423455e-01 1.53317458e-01 5.26379841e-02\n",
" 4.59042372e-02 -4.24190171e-02 5.12941282e-02]\n",
" [ 1.18265588e-01 -4.01551765e-17 3.32228875e-02 5.82798258e-01\n",
" -1.03156608e-01 2.90763664e-02 -1.02920870e-01 3.59853540e-02\n",
" -8.63318292e-02 -8.55234634e-02 9.50055103e-02 7.30506478e-02\n",
" -8.12885487e-02 7.30693373e-02 -2.29889975e-02]\n",
" [ 2.53048747e-02 2.14786803e-17 5.28183900e-02 2.93948057e-01\n",
" -4.68856897e-02 3.00035921e-02 2.70133241e-02 7.07549500e-03\n",
" 8.97378869e-02 1.75141930e-02 -5.98529841e-02 -2.87731531e-02\n",
" 3.54420870e-01 -4.54786132e-02 1.64249224e-02]]\n"
]
}
],
"source": [
"from sklearn.decomposition import TruncatedSVD\n",
"\n",
"svd = TruncatedSVD(n_components=15, n_iter=30, random_state=0)\n",
"X = svd.fit_transform(wordmatrix.drop('row_total',axis=1))\n",
"print(X)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "9a25fb11e63b4f6cbdedea03e492d4cbc2005ce9"
},
"source": [
"# Clustering\n",
"Because everyone learned clustering from 5P11 and 5P2, I won't explain clustering here.\n",
"\n",
"Comparing different algorism of clustering, [link](https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html)\n",
"\n",
"## You can also watch\n",
"\n",
"<a style='color:blue' href='http://www.rel8ed.to/2019/coldest_capital_hottest_women_owned_business/'>My Tabluea Clustering Tutorial<a>"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"_uuid": "dbc81ddcf2e32f084860e793181a7040b2b6ee4e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
" 1 1 1 1 0 1 1 1 0 1]\n"
]
}
],
"source": [
"from sklearn.mixture import GaussianMixture\n",
"gmm = GaussianMixture(n_components=3)\n",
"gmm.fit(X)\n",
"result = gmm.predict(X)\n",
"print(result)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"_uuid": "e4b464434b72fd02403ee7514e2ac835ccb7e216"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0 0 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 1 0 0 2 0 0]\n"
]
}
],
"source": [
"from sklearn.cluster import SpectralClustering\n",
"clustering = SpectralClustering(n_clusters=3, random_state=0)\n",
"clustering.fit(X)\n",
"result0 = clustering.labels_\n",
"print(result0)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"_kg_hide-output": false,
"_uuid": "9254c2f6829cfff94be163dc3b44ffc462a08d1a"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Target_Subject</th>\n",
" <th>TextField</th>\n",
" <th>cluster</th>\n",
" <th>cluster0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>Bob has two dogs and one cat. The cat is bigg...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>S</td>\n",
" <td>Carmelo Anthony scored 42 points to lead the N...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>S</td>\n",
" <td>Come play baseball with us.</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>S</td>\n",
" <td>Derek Jeter, the captain of the New York Yanke...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>S</td>\n",
" <td>Do you have a baseball or a football that we c...</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>A</td>\n",
" <td>Do you like big dogs or little dogs? Dogs a...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>W</td>\n",
" <td>During the winter, the sun is lower in the sky...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>A</td>\n",
" <td>House cats behave very much like their big cou...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>A</td>\n",
" <td>I have a friend who had 5 cats in her hourse....</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>W</td>\n",
" <td>I like the springtime when the weather is not ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>A</td>\n",
" <td>I think animals with spots and stripes, like t...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>W</td>\n",
" <td>I think I prefer very hot weather to very cold...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>S</td>\n",
" <td>I used to play Little League baseball and bask...</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>W</td>\n",
" <td>If it rains tomorrow, let's not go outside. I...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>W</td>\n",
" <td>If there is rain or snow, I am still going out...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>A</td>\n",
" <td>If we only have 30 minutes, should we visit th...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>S</td>\n",
" <td>In the National Basketball Association, three ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>W</td>\n",
" <td>Jack and Mary could not go to the picnic becau...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>W</td>\n",
" <td>Jack likes the snow and ice of winter. He doe...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>A</td>\n",
" <td>John went to the zoo and saw a lion, a tiger, ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>A</td>\n",
" <td>Lions are usually a little smaller than tigers...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>A</td>\n",
" <td>Mary likes to watch animal documentaries on te...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>W</td>\n",
" <td>More snow is predicted for the Northeast.</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>S</td>\n",
" <td>My favorite baseball player of all times is Wi...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>A</td>\n",
" <td>My favorite zoo is the Bronx Zoo. I usually g...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>A</td>\n",
" <td>My favorite zoo is the San Diego Zoo. I love...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>A</td>\n",
" <td>Orca whales prey on seals and lions prey on ze...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>W</td>\n",
" <td>Phoenix, Arizona, had it's fourth hottest day ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>S</td>\n",
" <td>Second-half goals gave the Bayern Munich socce...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>S</td>\n",
" <td>Ted Williams is the last person to have a batt...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>A</td>\n",
" <td>The biggest animal in the world is the blue wh...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>A</td>\n",
" <td>The honey badger is a small but ferocious anim...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>S</td>\n",
" <td>The Japanese pitcher, Masahiro Tanaka, has sig...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>A</td>\n",
" <td>The lions are near the zebras in the Bronx Zoo.</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>S</td>\n",
" <td>The number one rated Syracuse University bask...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>W</td>\n",
" <td>The snow fell heavily but then the rain came a...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>S</td>\n",
" <td>The U.S. soccer team beat the Russian team in ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>W</td>\n",
" <td>The weather in NYC is hot in the summer and co...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>W</td>\n",
" <td>There was so much rain, we could not go out. ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>W</td>\n",
" <td>This has been a very difficult winter, much co...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>S</td>\n",
" <td>UCLA has won the NCAA basketball championship ...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>W</td>\n",
" <td>We have had snow, snow and more snow for 10 da...</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>A</td>\n",
" <td>We went to the zoo and saw a lion, a tiger, el...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>A</td>\n",
" <td>If you like hot weather, the Arizona desert i...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>S</td>\n",
" <td>Which sport do you like more: soccer, basebal...</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>W</td>\n",
" <td>Winter days are often so pretty, even if it is...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>W</td>\n",
" <td>Winter is my favorite season. I love the cold...</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Target_Subject ... cluster0\n",
"0 A ... 0\n",
"1 S ... 0\n",
"2 S ... 2\n",
"3 S ... 0\n",
"4 S ... 2\n",
"5 A ... 0\n",
"6 W ... 0\n",
"7 A ... 0\n",
"8 A ... 0\n",
"9 W ... 0\n",
"10 A ... 0\n",
"11 W ... 0\n",
"12 S ... 2\n",
"13 W ... 0\n",
"14 W ... 0\n",
"15 A ... 0\n",
"16 S ... 0\n",
"17 W ... 0\n",
"18 W ... 0\n",
"19 A ... 0\n",
"20 A ... 0\n",
"21 A ... 0\n",
"22 W ... 1\n",
"23 S ... 0\n",
"24 A ... 0\n",
"25 A ... 0\n",
"26 A ... 0\n",
"27 W ... 0\n",
"28 S ... 0\n",
"29 S ... 0\n",
"30 A ... 0\n",
"31 A ... 0\n",
"32 S ... 0\n",
"33 A ... 0\n",
"34 S ... 0\n",
"35 W ... 0\n",
"36 S ... 0\n",
"37 W ... 0\n",
"38 W ... 0\n",
"39 W ... 0\n",
"40 S ... 0\n",
"41 W ... 1\n",
"42 A ... 0\n",
"43 A ... 0\n",
"44 S ... 2\n",
"45 W ... 0\n",
"46 W ... 0\n",
"\n",
"[47 rows x 4 columns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['cluster'] = result\n",
"data['cluster0'] = result0\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "50791e37ae813d3a1595827cab5e8a99737a9523"
},
"source": [
"# Word Topics\n",
"Simply speaking, word topic is to tanspose your text matrix, and generate cluster for keywords (orignally clusering works on docments)\n",
"Here we use Transposed SVD to make word topics.\n",
"1. Transpose the SVD data\n",
"2. Clustering on transposed matrix\n",
"3. Refer the keywords\n",
"Let's do it step by step:"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "8f7ca4b5ca98136380a3a1d6580dceeb5fe66db2"
},
"source": [
"## Transpose SVDs"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"_uuid": "d85774d4bf9cf8e1c54138dccf11fb8375b6c970"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>10</th>\n",
" <th>11</th>\n",
" <th>12</th>\n",
" <th>13</th>\n",
" <th>14</th>\n",
" <th>15</th>\n",
" <th>16</th>\n",
" <th>17</th>\n",
" <th>18</th>\n",
" <th>19</th>\n",
" <th>20</th>\n",
" <th>21</th>\n",
" <th>22</th>\n",
" <th>23</th>\n",
" <th>24</th>\n",
" <th>25</th>\n",
" <th>26</th>\n",
" <th>27</th>\n",
" <th>28</th>\n",
" <th>29</th>\n",
" <th>30</th>\n",
" <th>31</th>\n",
" <th>32</th>\n",
" <th>33</th>\n",
" <th>34</th>\n",
" <th>35</th>\n",
" <th>36</th>\n",
" <th>37</th>\n",
" <th>38</th>\n",
" <th>39</th>\n",
" <th>40</th>\n",
" <th>41</th>\n",
" <th>42</th>\n",
" <th>43</th>\n",
" <th>44</th>\n",
" <th>45</th>\n",
" <th>46</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>one</th>\n",
" <td>0.195867</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.124643</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>go</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.108125</td>\n",
" <td>0.00000</td>\n",
" <td>0.194626</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.097313</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.176932</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.243282</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>honey</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.238871</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>come</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.557366</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>john</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.33442</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>football</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.342767</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.457023</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>summers</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.209012</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>dogs</th>\n",
" <td>0.195867</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.457023</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>sky</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.238871</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mary</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.137107</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.137107</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>sun</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.238871</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>soccer</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.124643</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.171383</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>temperature</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.209012</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>play</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.356679</td>\n",
" <td>0.000000</td>\n",
" <td>0.267509</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.17834</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.107004</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>fourth</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.209012</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>points</th>\n",
" <td>0.000000</td>\n",
" <td>0.152009</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>champion</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.152009</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>new</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.152341</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.137107</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>came</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.278683</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hot</th>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.243282</td>\n",
" <td>0.0</td>\n",
" <td>0.216251</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.139018</td>\n",
" <td>0.00000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.121641</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.088466</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 ... 44 45 46\n",
"one 0.195867 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"go 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"honey 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"come 0.000000 0.000000 0.557366 ... 0.000000 0.0 0.0\n",
"john 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"football 0.000000 0.000000 0.000000 ... 0.457023 0.0 0.0\n",
"summers 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"dogs 0.195867 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"sky 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"mary 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"sun 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"soccer 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"temperature 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"play 0.000000 0.000000 0.356679 ... 0.000000 0.0 0.0\n",
"fourth 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"points 0.000000 0.152009 0.000000 ... 0.000000 0.0 0.0\n",
"champion 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"new 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"came 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"hot 0.000000 0.000000 0.000000 ... 0.000000 0.0 0.0\n",
"\n",
"[20 rows x 47 columns]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docmatrix = wordmatrix.drop('row_total',axis=1).transpose()\n",
"docmatrix.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"_uuid": "b0faf9f72dba6c6d8818a138ad63afa855d0500b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 8.95888464e-05 2.76529704e-17 6.13529096e-03 ... 2.56707478e-03\n",
" 1.20033102e-01 2.52412431e-02]\n",
" [ 4.78461873e-03 3.11884218e-16 8.31705855e-02 ... 1.18805761e-02\n",
" 1.00836739e-02 9.19754643e-03]\n",
" [ 3.76244258e-05 3.66114798e-18 1.86434271e-03 ... 9.41357344e-04\n",
" -9.27008638e-03 4.21242977e-02]\n",
" ...\n",
" [ 7.45553963e-05 -4.97149341e-19 2.04161638e-02 ... -5.02181538e-03\n",
" 6.43200159e-03 -1.60633851e-03]\n",
" [ 3.07983294e-04 1.07233950e-16 1.33662644e-02 ... -1.03918531e-02\n",
" -1.73499623e-02 -6.79046585e-02]\n",
" [ 1.48413985e-02 -1.62528810e-17 1.41559840e-02 ... -1.89267967e-02\n",
" -2.14631483e-03 -5.68034601e-02]]\n"
]
}
],
"source": [
"svdT = TruncatedSVD(n_components=5, n_iter=30, random_state=0)\n",
"XT = svd.fit_transform(docmatrix)\n",
"print(XT)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"_uuid": "6d4bed234080a54b8367c7aa7fa75bea18bf1fa7"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(47, 15)\n",
"(201, 15)\n"
]
}
],
"source": [
"print(X.shape)\n",
"print(XT.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "546398237a264979de6b51d6353427885bffbdb7"
},
"source": [
"## Topic Creating"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"_uuid": "0b7661ba2d1c134f168c8bae6b2f519476ab2f51"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0 0 0 5 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 2 0 0 0 0\n",
" 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 4 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 4 0 0 4 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0]\n"
]
}
],
"source": [
"topic = GaussianMixture(n_components=6)\n",
"topic.fit(XT)\n",
"topic_label = topic.predict(XT)\n",
"print(topic_label)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "1748472c5745a3c046ccd085abca21001de2b07e"
},
"source": [
"## Mark keywords"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"_uuid": "a40c23a1930c2b481c620614afa638cc4cbd6e8b"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>one</th>\n",
" <th>stop</th>\n",
" <th>willie</th>\n",
" <th>likes</th>\n",
" <th>watching</th>\n",
" <th>score</th>\n",
" <th>record</th>\n",
" <th>wonderful</th>\n",
" <th>russian</th>\n",
" <th>derek</th>\n",
" <th>see</th>\n",
" <th>next</th>\n",
" <th>diego</th>\n",
" <th>think</th>\n",
" <th>san</th>\n",
" <th>yankees</th>\n",
" <th>university</th>\n",
" <th>efficient</th>\n",
" <th>behave</th>\n",
" <th>warm</th>\n",
" <th>wildcats</th>\n",
" <th>defend</th>\n",
" <th>bob</th>\n",
" <th>higher</th>\n",
" <th>lower</th>\n",
" <th>fell</th>\n",
" <th>tigers</th>\n",
" <th>documentaries</th>\n",
" <th>said</th>\n",
" <th>rain</th>\n",
" <th>bigger</th>\n",
" <th>pretty</th>\n",
" <th>two</th>\n",
" <th>basketball</th>\n",
" <th>signed</th>\n",
" <th>unusual</th>\n",
" <th>omnivorous</th>\n",
" <th>nyc</th>\n",
" <th>team</th>\n",
" <th>college</th>\n",
" <th>...</th>\n",
" <th>player</th>\n",
" <th>sons</th>\n",
" <th>animals</th>\n",
" <th>prefer</th>\n",
" <th>even</th>\n",
" <th>ucla</th>\n",
" <th>washed</th>\n",
" <th>spots</th>\n",
" <th>antelope</th>\n",
" <th>friend</th>\n",
" <th>gave</th>\n",
" <th>small</th>\n",
" <th>game</th>\n",
" <th>watch</th>\n",
" <th>brutal</th>\n",
" <th>ny</th>\n",
" <th>batting</th>\n",
" <th>colder</th>\n",
" <th>hottest</th>\n",
" <th>cold</th>\n",
" <th>syracuse</th>\n",
" <th>used</th>\n",
" <th>cats</th>\n",
" <th>bad</th>\n",
" <th>visit</th>\n",
" <th>preference</th>\n",
" <th>look</th>\n",
" <th>snow</th>\n",
" <th>predicted</th>\n",
" <th>lions</th>\n",
" <th>near</th>\n",
" <th>zebras</th>\n",
" <th>bronx</th>\n",
" <th>biggest</th>\n",
" <th>blue</th>\n",
" <th>animal</th>\n",
" <th>world</th>\n",
" <th>play</th>\n",
" <th>come</th>\n",
" <th>baseball</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>topic</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" one stop willie likes ... world play come baseball\n",
"topic 0 0 0 0 ... 4 5 5 5\n",
"\n",
"[1 rows x 201 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"topic_table = pd.DataFrame([topic_label], columns=docmatrix.index)\n",
"topic_table.index = ['topic']\n",
"topic_table = topic_table.sort_values(by='topic',axis=1)\n",
"topic_table"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "18e591344de20f356b0cbe988f1bff6a3f469607"
},
"source": [
"\n",
"# Visualization T-SNE\n",
"We can transform the topicmatrix into a two dimension matrix(by T-SNE), so we can use visualization to see the distribution of the words\n",
"\n",
"### For more about visualization check [here](https://seaborn.pydata.org/examples/index.html)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"_uuid": "c6d58353327ecedf4f4aa7ae48e4295d737d0344"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-6.76369238e+00 -5.23719358e+00]\n",
" [-1.67009659e+01 4.06276655e+00]\n",
" [ 9.69947052e+00 4.90797329e+00]\n",
" [ 2.69338369e+00 9.73977661e+00]\n",
" [-1.46976109e+01 8.20248276e-02]\n",
" [-1.10127535e+01 -1.03571663e+01]\n",
" [ 4.10937220e-01 -8.36442280e+00]\n",
" [-1.14654980e+01 -3.54592824e+00]\n",
" [-7.25076723e+00 9.18633556e+00]\n",
" [-2.76946354e+00 -4.80147314e+00]\n",
" [-7.31669664e+00 9.18250370e+00]\n",
" [ 7.31961203e+00 1.11965094e+01]\n",
" [-3.75270277e-01 -8.73570251e+00]\n",
" [ 2.20890927e+00 1.02454376e+01]\n",
" [ 9.14071202e-01 -8.89315796e+00]\n",
" [ 3.30642247e+00 -1.17555773e+00]\n",
" [ 5.34020901e-01 8.16595495e-01]\n",
" [ 1.36187887e+00 2.85489774e+00]\n",
" [-8.84150887e+00 -1.39988155e+01]\n",
" [-2.80860472e+00 -1.74235344e+01]\n",
" [ 3.32997537e+00 -1.70275831e+00]\n",
" [-7.76730490e+00 1.57236481e+01]\n",
" [ 1.62287664e+00 2.76791954e+00]\n",
" [-5.68997955e+00 7.00568151e+00]\n",
" [-2.16177678e+00 2.78121781e+00]\n",
" [ 1.22636814e+01 1.26761093e+01]\n",
" [ 5.72865868e+00 -1.46843576e+01]\n",
" [-5.23921251e+00 2.72362804e+00]\n",
" [ 3.38269901e+00 2.69616437e+00]\n",
" [ 9.26695883e-01 8.32612932e-01]\n",
" [-7.21816492e+00 -7.85041714e+00]\n",
" [ 6.32113743e+00 -2.22701859e+00]\n",
" [ 3.42581129e+00 2.32126689e+00]\n",
" [-9.15271664e+00 2.77823186e+00]\n",
" [-3.17294097e+00 -1.31882982e+01]\n",
" [ 1.21698685e+01 -4.63682604e+00]\n",
" [-3.77047801e+00 8.55199051e+00]\n",
" [-1.94692171e+00 -1.09491110e+00]\n",
" [-8.01215363e+00 -1.04986782e+01]\n",
" [ 9.99329090e+00 5.59132481e+00]\n",
" [-1.06247787e+01 6.81042671e+00]\n",
" [-5.65561485e+00 2.74425292e+00]\n",
" [ 6.06679678e+00 -1.50778990e+01]\n",
" [ 6.78572035e+00 2.09338164e+00]\n",
" [-1.38848057e+01 3.25072408e-01]\n",
" [-9.11798763e+00 3.37251043e+00]\n",
" [-2.69640899e+00 1.06007576e+01]\n",
" [-3.00288391e+00 -1.32994747e+01]\n",
" [ 2.83526564e+00 -2.72858381e+00]\n",
" [-4.88417768e+00 2.37430021e-01]\n",
" [-3.82611084e+00 -4.47712755e+00]\n",
" [-6.38326645e+00 3.52848172e+00]\n",
" [-1.69757423e+01 -5.32381916e+00]\n",
" [ 9.17365456e+00 -4.94543123e+00]\n",
" [ 1.13256431e+00 3.73961896e-01]\n",
" [ 9.24816132e+00 5.58956575e+00]\n",
" [ 1.97273344e-02 6.55266166e-01]\n",
" [ 1.04740744e+01 9.35600185e+00]\n",
" [-3.58551323e-01 -9.41965485e+00]\n",
" [ 2.42659235e+00 1.04883881e+01]\n",
" [ 2.27327394e+00 1.71218929e+01]\n",
" [ 2.45221210e+00 -2.02946544e+00]\n",
" [-6.47739697e+00 2.62399864e+00]\n",
" [-4.28759718e+00 -4.16109228e+00]\n",
" [ 1.22859869e+01 -4.79285908e+00]\n",
" [ 1.15511978e+00 -9.38829708e+00]\n",
" [ 2.53705001e+00 1.73369102e+01]\n",
" [ 5.03878784e+00 -6.29178858e+00]\n",
" [ 1.39305186e+00 5.33637142e+00]\n",
" [-1.00780745e+01 6.80247593e+00]\n",
" [-2.22521687e+00 1.04381666e+01]\n",
" [-1.17351627e+01 -6.49035025e+00]\n",
" [-6.94979286e+00 1.56634007e+01]\n",
" [-2.53020406e+00 -9.01576281e-01]\n",
" [-6.27615070e+00 2.73640800e+00]\n",
" [ 3.42472291e+00 -9.45588684e+00]\n",
" [-1.69640770e+01 -5.27505159e+00]\n",
" [ 3.01991868e+00 -2.34384823e+00]\n",
" [-1.67197151e+01 4.05641174e+00]\n",
" [-1.63477957e-02 -4.69600248e+00]\n",
" [ 3.35871100e-01 -3.59765244e+00]\n",
" [ 3.54437184e+00 -2.71062398e+00]\n",
" [ 1.34254837e+01 -2.71529317e-01]\n",
" [ 2.24320388e+00 5.54694223e+00]\n",
" [-8.70687389e+00 -1.42778625e+01]\n",
" [ 6.62417603e+00 -2.22884536e+00]\n",
" [ 7.20978451e+00 -2.45873466e-01]\n",
" [ 1.59380598e+01 5.25363731e+00]\n",
" [ 3.34709239e+00 -9.42370605e+00]\n",
" [-7.66405535e+00 1.51872244e+01]\n",
" [-2.07075810e+00 1.08273439e+01]\n",
" [-9.59236908e+00 -7.88743305e+00]\n",
" [-2.13446355e+00 2.86787629e+00]\n",
" [-5.17639494e+00 6.98901367e+00]\n",
" [ 9.73122978e+00 1.52654290e-01]\n",
" [-1.39989414e+01 1.05925426e-01]\n",
" [-3.79200292e+00 -1.74596214e+01]\n",
" [ 9.70897579e+00 -6.85149479e+00]\n",
" [-7.31533861e+00 1.58900242e+01]\n",
" [-8.90886974e+00 -1.44815464e+01]\n",
" [ 6.75021744e+00 -1.33337049e+01]\n",
" [ 6.63654280e+00 2.23209786e+00]\n",
" [-1.12692368e+00 5.21087646e+00]\n",
" [-4.64081907e+00 5.83946928e-02]\n",
" [-6.85378361e+00 -1.40862694e+01]\n",
" [-9.68304920e+00 -7.97464609e+00]\n",
" [-5.72583771e+00 -5.54998112e+00]\n",
" [ 1.58530302e+01 5.28470325e+00]\n",
" [ 5.36959124e+00 5.46039200e+00]\n",
" [ 1.02024851e+01 -8.94726336e-01]\n",
" [ 9.61873055e-01 -4.21893406e+00]\n",
" [-5.86169481e+00 9.82300043e-01]\n",
" [-1.91152930e+00 -1.19391167e+00]\n",
" [ 9.34802532e+00 -5.74689388e+00]\n",
" [ 6.37762737e+00 1.20903530e+01]\n",
" [ 1.43955934e+00 6.23657703e+00]\n",
" [-9.13607121e+00 -1.41936016e+01]\n",
" [-6.14869022e+00 -6.10750198e+00]\n",
" [ 1.39690580e+01 -3.96857810e+00]\n",
" [ 7.17538238e-01 -6.91330731e-02]\n",
" [ 1.18261445e+00 5.87704372e+00]\n",
" [-7.33197117e+00 9.18039131e+00]\n",
" [ 3.46225572e+00 2.36977673e+00]\n",
" [-5.54567194e+00 3.21173453e+00]\n",
" [ 9.96276855e+00 -8.01652241e+00]\n",
" [-2.75596762e+00 -1.31205482e+01]\n",
" [-6.67024469e+00 -1.96669734e+00]\n",
" [-1.36015749e+00 6.75465727e+00]\n",
" [-1.23532498e+00 5.22279024e+00]\n",
" [ 5.47984505e+00 1.00690508e+01]\n",
" [ 3.35373461e-01 -9.37738132e+00]\n",
" [ 5.45475674e+00 -4.43892908e+00]\n",
" [ 6.57593298e+00 1.04888430e+01]\n",
" [-4.42287874e+00 4.51249748e-01]\n",
" [-3.69386196e+00 -4.16496897e+00]\n",
" [-7.91220546e-01 1.50092058e+01]\n",
" [ 9.48171902e+00 5.92798805e+00]\n",
" [-1.17670832e+01 -6.63310385e+00]\n",
" [-1.50458169e+00 1.02150860e+01]\n",
" [ 1.11499968e+01 8.60981178e+00]\n",
" [-2.88192415e+00 1.37214017e+00]\n",
" [-7.81617343e-01 1.49781218e+01]\n",
" [ 2.24863076e+00 6.13130569e+00]\n",
" [ 7.00972223e+00 2.09101796e+00]\n",
" [ 7.14752102e+00 2.32446480e+00]\n",
" [-4.24760294e+00 -3.85383940e+00]\n",
" [-8.04824382e-03 -4.15565968e+00]\n",
" [ 4.87340355e+00 5.16068983e+00]\n",
" [-8.76339531e+00 6.96477699e+00]\n",
" [-1.19139969e+00 -3.51160240e+00]\n",
" [-1.47762418e+00 4.88696432e+00]\n",
" [-1.40584879e+01 2.26603776e-01]\n",
" [ 1.83750582e+00 5.31320381e+00]\n",
" [ 4.57757652e-01 -4.28394890e+00]\n",
" [ 1.00681086e+01 -7.80259657e+00]\n",
" [ 2.65676689e+00 -1.51940906e+00]\n",
" [ 9.75273418e+00 5.72063267e-01]\n",
" [-3.88228893e+00 -3.41816449e+00]\n",
" [ 5.54670274e-01 1.60484390e+01]\n",
" [-1.31385164e+01 2.94271380e-01]\n",
" [-5.81082010e+00 3.61163592e+00]\n",
" [-7.63309777e-01 1.49923639e+01]\n",
" [-1.37693286e+00 5.18780947e+00]\n",
" [ 3.45786065e-01 -8.22964460e-02]\n",
" [ 6.44907856e+00 1.03619080e+01]\n",
" [ 4.87351036e+00 -6.43435383e+00]\n",
" [ 1.96394086e-01 -8.54265690e+00]\n",
" [ 3.90752959e+00 -2.28260159e+00]\n",
" [-3.47586489e+00 -3.71008039e+00]\n",
" [-1.75634441e+01 -4.30430508e+00]\n",
" [ 5.87243748e+00 -1.51246157e+01]\n",
" [-2.30654788e+00 9.84119701e+00]\n",
" [ 2.17870817e-01 -5.07623053e+00]\n",
" [-4.77789927e+00 -2.72243571e+00]\n",
" [ 5.54232025e+00 -5.10954571e+00]\n",
" [-2.36396241e+00 -1.20529199e+00]\n",
" [-4.30674887e+00 5.71991861e-01]\n",
" [-2.10557055e+00 -6.36856437e-01]\n",
" [-5.99617624e+00 -6.30832195e+00]\n",
" [ 3.88176203e+00 -1.58213949e+00]\n",
" [ 1.82775033e+00 6.39122868e+00]\n",
" [-7.12198830e+00 1.51440859e+01]\n",
" [-9.64028388e-03 2.95027375e-01]\n",
" [-3.12942529e+00 -1.73486118e+01]\n",
" [ 1.59521770e+01 5.06241131e+00]\n",
" [ 9.14473438e+00 4.85175562e+00]\n",
" [ 4.14039969e-01 -9.83350754e+00]\n",
" [ 1.21542998e-03 1.77156246e+00]\n",
" [ 6.14672375e+00 1.09232817e+01]\n",
" [-3.20450282e+00 -1.34724445e+01]\n",
" [ 2.21267533e+00 1.71683025e+01]\n",
" [ 6.31269455e+00 1.09341211e+01]\n",
" [ 5.48121405e+00 6.88574433e-01]\n",
" [-2.13179946e-01 2.49058890e+00]\n",
" [ 7.74996638e-01 -4.80834055e+00]\n",
" [ 7.38340425e+00 -2.62660599e+00]\n",
" [-1.68176174e+00 -1.74774418e+01]\n",
" [ 6.60064697e+00 -1.57822809e+01]\n",
" [ 3.70529890e+00 2.63849926e+00]\n",
" [-1.71577606e+01 -5.00703430e+00]\n",
" [ 4.92427826e+00 -1.56783533e+01]]\n"
]
}
],
"source": [
"from sklearn.manifold import TSNE\n",
"tsne = TSNE(n_components=2, n_iter=300)\n",
"coordinates = tsne.fit_transform(docmatrix)\n",
"print(coordinates)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"_uuid": "667092285f17e31037791d778c155e6ba8a473c3"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>x</th>\n",
" <th>y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-6.763692</td>\n",
" <td>-5.237194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-16.700966</td>\n",
" <td>4.062767</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9.699471</td>\n",
" <td>4.907973</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.693384</td>\n",
" <td>9.739777</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-14.697611</td>\n",
" <td>0.082025</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" x y\n",
"0 -6.763692 -5.237194\n",
"1 -16.700966 4.062767\n",
"2 9.699471 4.907973\n",
"3 2.693384 9.739777\n",
"4 -14.697611 0.082025"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wordmap = pd.DataFrame(coordinates, columns=['x','y'])\n",
"wordmap.head()"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"_uuid": "d7f6f16a36ca101005e916498923c58f034c7302"
},
"outputs": [],
"source": [
"from sklearn.cluster import KMeans\n",
"word_label = KMeans(n_clusters=3).fit(wordmap).labels_"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"_uuid": "18e33b44d245b5832bc92daa7b1312859ed712d3"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>x</th>\n",
" <th>y</th>\n",
" <th>word</th>\n",
" <th>cluster</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-6.763692</td>\n",
" <td>-5.237194</td>\n",
" <td>one</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-16.700966</td>\n",
" <td>4.062767</td>\n",
" <td>go</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>9.699471</td>\n",
" <td>4.907973</td>\n",
" <td>honey</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.693384</td>\n",
" <td>9.739777</td>\n",
" <td>come</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-14.697611</td>\n",
" <td>0.082025</td>\n",
" <td>john</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" x y word cluster\n",
"0 -6.763692 -5.237194 one 0\n",
"1 -16.700966 4.062767 go 2\n",
"2 9.699471 4.907973 honey 1\n",
"3 2.693384 9.739777 come 2\n",
"4 -14.697611 0.082025 john 2"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wordmap['word'] = docmatrix.index\n",
"wordmap['cluster'] = word_label\n",
"wordmap.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "b464280fefa9549ab36e407f21283d990925fec5"
},
"source": [
"## Draw map"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"_uuid": "4db228f5cbbc67f8dd61dabb08fe004606619412"
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x1440 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns \n",
"from matplotlib import pyplot as plt\n",
"plt.figure(figsize=(20, 20))\n",
"sns.scatterplot('x','y',hue='cluster',palette=\"Set1\",s=150, data=wordmap)\n",
"for n in range(len(wordmap)):\n",
" plt.annotate(wordmap['word'][n],\n",
" xy=(wordmap['x'][n],wordmap['y'][n]),\n",
" xytext=(2,5), textcoords='offset points', fontsize=16)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "64093ae3429ee46e59e2f0b02c913d6e3f71799f"
},
"source": [
"# Auto Text Mining with TextBlob\n",
"With professional textmining Python library, we don't have to create each step manually.\n",
"\n",
"The reason why I create detailed steps is to help you get better understanding of the machanism behind the Textmining Technology.\n",
"\n",
"In the production environment, it is better to use the tools to simplify you analysis\n",
"\n",
"### For your convinience, I restart from scrach.\n",
"\n",
"See more on textblob official [website](https://textblob.readthedocs.io/en/dev/classifiers.html)\n",
"\n",
"<p style='color:red'> To use textblob you need to install it first, in command line, run \"pip install textblob\", or check the ofiicial website for help <p>"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "04174f08f58a4d4ddd2a17fc51fa7684e2546aaf"
},
"source": [
"## Read data"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"_uuid": "980e0231ba3b00ae8bb2ae9fd4e68de2ab1f8654"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Target_Subject</th>\n",
" <th>TextField</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>Bob has two dogs and one cat. The cat is bigg...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>S</td>\n",
" <td>Carmelo Anthony scored 42 points to lead the N...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>S</td>\n",
" <td>Come play baseball with us.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>S</td>\n",
" <td>Derek Jeter, the captain of the New York Yanke...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>S</td>\n",
" <td>Do you have a baseball or a football that we c...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Target_Subject TextField\n",
"0 A Bob has two dogs and one cat. The cat is bigg...\n",
"1 S Carmelo Anthony scored 42 points to lead the N...\n",
"2 S Come play baseball with us.\n",
"3 S Derek Jeter, the captain of the New York Yanke...\n",
"4 S Do you have a baseball or a football that we c..."
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"train_data = pd.read_csv('../input/WeatherAnimalsSports.csv')\n",
"score_data = pd.read_csv('../input/Score_WeatherAnimalSports.csv')\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"_uuid": "7415c4336cf9ea7cd8f133ce238081240e5c1d78"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>TextField</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>We have a dog in our house. His name is Princ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>I spend a lot of time on the weekend watching ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>The World Cup in soccer is held every four yea...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>The 2013 World Series was won by the Boston Re...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>The winter weather has been very harsh in many...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" TextField\n",
"0 We have a dog in our house. His name is Princ...\n",
"1 I spend a lot of time on the weekend watching ...\n",
"2 The World Cup in soccer is held every four yea...\n",
"3 The 2013 World Series was won by the Boston Re...\n",
"4 The winter weather has been very harsh in many..."
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "3aed8027605284cf8c9825d83292204e5c7b9a90"
},
"source": [
"## Transform data into textblob required format"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"_uuid": "c482ec88bb7a366a6337295d7bed6c5b77ff3344"
},
"outputs": [
{
"data": {
"text/plain": [
"[('Bob has two dogs and one cat. The cat is bigger than either of the dogs.',\n",
" 'A'),\n",
" ('Carmelo Anthony scored 42 points to lead the NY Knicks basketball team to a win over the Florida Pelicans.',\n",
" 'S'),\n",
" ('Come play baseball with us.', 'S'),\n",
" ('Derek Jeter, the captain of the New York Yankees baseball team, said 2014 will be his last season playing.',\n",
" 'S'),\n",
" ('Do you have a baseball or a football that we could play with? You can be on my team.',\n",
" 'S')]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_blob = list(zip(train_data['TextField'],train_data['Target_Subject'])) # transform format\n",
"train_blob[:5] # check format"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"_uuid": "462354c6c94055625bfd4d07f0f6adf1b90b6b6c"
},
"outputs": [],
"source": [
"from textblob.classifiers import NaiveBayesClassifier\n",
"cl = NaiveBayesClassifier(train_blob)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "7c80e302222af71d62dcfeafac8f84e3d36cbfe5"
},
"source": [
"## Scoring"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"_uuid": "2a762df11ed7ec5c4503f807b816a589fb9d97e4"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"We have a dog in our house. His name is Princey and he is a part of our family.\n",
"A\n"
]
}
],
"source": [
"score0 = cl.classify(score_data['TextField'][0])\n",
"print(score_data['TextField'][0])\n",
"print(score0)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"_uuid": "151c497e1484494fa45c956c5badaa833aedc203"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>TextField</th>\n",
" <th>prediction</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>We have a dog in our house. His name is Princ...</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>I spend a lot of time on the weekend watching ...</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>The World Cup in soccer is held every four yea...</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>The 2013 World Series was won by the Boston Re...</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>The winter weather has been very harsh in many...</td>\n",
" <td>W</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Yesterday, I watched a documentary about the b...</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>In our neighborhood, one man has 5 small dogs ...</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>We have a problem with feral cats.</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Professional basketball players are paid enorm...</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>We have friends who live in a rural area and t...</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>We have had 5 inches of snow since this morning.</td>\n",
" <td>W</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Rainy weather in the summer can be very pleasa...</td>\n",
" <td>W</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>I could watch elephants all day. They are suc...</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>I would like to spend summers in Maine and win...</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Leopards are nocturnal hunters.</td>\n",
" <td>A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>I watched a nature movie about bears hibernati...</td>\n",
" <td>W</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" TextField prediction\n",
"0 We have a dog in our house. His name is Princ... A\n",
"1 I spend a lot of time on the weekend watching ... S\n",
"2 The World Cup in soccer is held every four yea... S\n",
"3 The 2013 World Series was won by the Boston Re... S\n",
"4 The winter weather has been very harsh in many... W\n",
"5 Yesterday, I watched a documentary about the b... A\n",
"6 In our neighborhood, one man has 5 small dogs ... A\n",
"7 We have a problem with feral cats. A\n",
"8 Professional basketball players are paid enorm... S\n",
"9 We have friends who live in a rural area and t... A\n",
"10 We have had 5 inches of snow since this morning. W\n",
"11 Rainy weather in the summer can be very pleasa... W\n",
"12 I could watch elephants all day. They are suc... A\n",
"13 I would like to spend summers in Maine and win... A\n",
"14 Leopards are nocturnal hunters. A\n",
"15 I watched a nature movie about bears hibernati... W"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"scores = [cl.classify(sentence) for sentence in score_data['TextField']]\n",
"score_data['prediction'] = scores\n",
"score_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "62966739641ac58668e1669fd2cdf04979361f39"
},
"source": [
"## Sentiment Analysis"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"_uuid": "63f8230baefbf841ca16e77e032ce79150207f3e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I spend a lot of time on the weekend watching sports shows: football, baseball, basketball and soccer are all fun for me.\n"
]
}
],
"source": [
"text0 = score_data['TextField'][1]\n",
"print(text0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "28fd7b5ba9c8f9cd0f4b73615170d6739d595e9a"
},
"source": [
"Sentiment analysis for the whole score data"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"_uuid": "7e94e749f32ba47a0f7d6539b82a73d37753fb6b"
},
"outputs": [
{
"data": {
"text/plain": [
"0.3"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from textblob import TextBlob\n",
"blob0 = TextBlob(text0)\n",
"blob0.sentiment.polarity"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "e174015fe38fb843ff76c4477ae0085be59be70e"
},
"source": [
"sentiments = []\n",
"for statement in score_data['TextField']:\n",
" blob = TextBlob(statement)\n",
" sentiment = blob.sentiment.polarity\n",
" sentiments.append(sentiment)\n",
"score_data['sentiment'] = sentiments\n",
"score_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "23cb991da752a87aa75253c6664d2bbcc94f2a83"
},
"source": [
"Good lubck on your study!\n",
"\n",
"![](https://www.calliopegifts.co.uk/img/product/new-job-good-luck-in-your-new-job-flittered-3004553-0.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "8cca9e3bbb0b3a05b5c6799307a80da4446d6048"
},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment