Skip to content

Instantly share code, notes, and snippets.

@VibhuJawa
Last active August 26, 2019 18:27
Show Gist options
  • Save VibhuJawa/9d7e177db4768a2bf9373a4e33bbad11 to your computer and use it in GitHub Desktop.
Save VibhuJawa/9d7e177db4768a2bf9373a4e33bbad11 to your computer and use it in GitHub Desktop.
Gutenburg read tokenize gv100_run
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gutenburg NLP Analysis:\n",
"\n",
"### Objective: Show case nlp capabilties of nvstrings+cudf\n",
"\n",
"### Pre-Processing :\n",
"* filter punctuation\n",
"* to_lower\n",
"* remove stop words (from nltk corpus)\n",
"* remove multiple spaces with one\n",
"* remove leading and trailing spaces \n",
" \n",
"### Word Count: \n",
"* Get Frequency count for the whole dataset\n",
"* Compare word count for two authors (Albert Einstein vs Charles Dickens )\n",
"* Get Word counts for all the authors\n",
"\n",
"### Encode the word-count for all authors into a count-vector\n",
"\n",
"We do this in two steps:\n",
"\n",
"1. Encode the string Series using `top 20k` most used `words` in the Dataset which we calculated earlier.\n",
" * We encode anything not in the series to string_id = `20_000` (`threshold`)\n",
"\n",
"\n",
"2. With the encoded count series for all authors, we create an aligned word-count vector for them, where:\n",
" * Where each column corresponds to a `word_id` from the the `top 20k words`\n",
" * Each row corresponds to the `count vector` for that author\n",
" \n",
" \n",
"### Find the nearest authors using the count-vector:\n",
"* Fit a knn\n",
"* Find the authors nearest to each other in the count vector space\n",
"* Decrease dimunitonality using UMAP\n",
"* Find the authors nearest to each other in the latent space"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Download Links:\n",
"\n",
"Download the data from: https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import cudf\n",
"import nvcategory\n",
"import os\n",
"import numpy as np\n",
"import nvtext\n",
"import cuml\n",
"import nvstrings\n",
"import nltk\n",
"from numba import cuda"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set Data Dir "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_dir = '../../gutenburg/txt'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read Text Frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Read helper functions"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def get_non_empty_lines(lines):\n",
" \"\"\"\n",
" returns non empty lines from a list of lines\n",
" \"\"\"\n",
" clean_lines = []\n",
" for line in lines:\n",
" str_line = line.strip()\n",
" if str_line:\n",
" clean_lines.append(str_line) \n",
" return clean_lines\n",
"\n",
"def get_txt_lines(data_dir):\n",
" \"\"\"\n",
" Read text lines from gutenberg texts\n",
" returns (text_ls,fname_ls) where \n",
" text_ls= input_text_lines and fname_ls = list of fnames\n",
" \"\"\"\n",
" text_ls = []\n",
" fname_ls = []\n",
" for fn in os.listdir(data_dir):\n",
" full_fn = os.path.join(data_dir,fn)\n",
" with open(full_fn,encoding=\"utf-8\",errors=\"ignore\") as f:\n",
" content = f.readlines()\n",
" content = get_non_empty_lines(content)\n",
" text_ls += content\n",
" ### dont add .txt to the file\n",
" fname_ls += [fn[:-4]]*len(content)\n",
" \n",
" return text_ls, fname_ls"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read text lines into a cudf dataframe"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"File Read Time:\n",
"CPU times: user 8.02 s, sys: 908 ms, total: 8.93 s\n",
"Wall time: 8.93 s\n",
"\n",
"CUDF Creation Time:\n",
"CPU times: user 2.08 s, sys: 845 ms, total: 2.92 s\n",
"Wall time: 3.2 s\n",
"Number of lines in the DF = 19,259,957\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>author</th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>THE STORY OF THE CHAMPIONS OF THE ROUND TABLE</td>\n",
" <td>Howard Pyle</td>\n",
" <td>The Story of the Champions of the Round Table</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Written and Illustrated by</td>\n",
" <td>Howard Pyle</td>\n",
" <td>The Story of the Champions of the Round Table</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HOWARD PYLE.</td>\n",
" <td>Howard Pyle</td>\n",
" <td>The Story of the Champions of the Round Table</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>In 1902 the distinguished American artist Howa...</td>\n",
" <td>Howard Pyle</td>\n",
" <td>The Story of the Champions of the Round Table</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>and illustrate the legend of King Arthur and t...</td>\n",
" <td>Howard Pyle</td>\n",
" <td>The Story of the Champions of the Round Table</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text author \\\n",
"0 THE STORY OF THE CHAMPIONS OF THE ROUND TABLE Howard Pyle \n",
"1 Written and Illustrated by Howard Pyle \n",
"2 HOWARD PYLE. Howard Pyle \n",
"3 In 1902 the distinguished American artist Howa... Howard Pyle \n",
"4 and illustrate the legend of King Arthur and t... Howard Pyle \n",
"\n",
" title \n",
"0 The Story of the Champions of the Round Table \n",
"1 The Story of the Champions of the Round Table \n",
"2 The Story of the Champions of the Round Table \n",
"3 The Story of the Champions of the Round Table \n",
"4 The Story of the Champions of the Round Table "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(\"File Read Time:\")\n",
"%time txt_ls,fname_ls = get_txt_lines(data_dir)\n",
"df = cudf.DataFrame()\n",
"\n",
"print(\"\\nCUDF Creation Time:\")\n",
"%time df['text'] = nvstrings.to_device(txt_ls)\n",
"\n",
"df['label'] = nvstrings.to_device(fname_ls)\n",
"title_label_df = df['label'].str.split('___')\n",
"df['author'] = title_label_df[0]\n",
"\n",
"df['title'] = title_label_df[1]\n",
"df = df.drop(labels=['label'])\n",
"\n",
"print(\"Number of lines in the DF = {:,}\".format(len(df)))\n",
"df.head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NLP Preprocessing\n",
"\n",
"In almost every workflow involving textual data, we'll want to do some kind of preprocessing before running our analysis. We might want to remove punctuation, standardize to all lowercase characters, and potentially dozens of other small tasks. RAPIDS makes developing GPU accelerated preprocessing pipelines smooth.\n",
"\n",
"Let's start by removing all the punctuation, since we don't want those characters to cloud our analysis. We could replace them one by one in many calls to replace. More realistically, we might generate a large regular expression pattern that looks for `!`, `,`, `%` and all of our other patterns and replaces them. It might look something like this: `(!)|(,)...|(%)`.\n",
"\n",
"A longer regex may or may not be less efficient necessarily on the GPU. If an instruction within the regex fails to match the current character being processed for the string, the rest of the expression does not need to be evaluated and we can move on to the next character. However, regexes with many alternation as in our case, may mean evaluating the same character over many more instructions before continuing. An alternation can be explicit like in `(\\bone\\b)|(\\b1\\b)` but also can be implicit like in `[aA]`.\n",
"\n",
"\n",
"This can be tedious, and isn't well suited to the GPU. \n",
"\n",
"Overall, avoiding regex can be more efficient since the algorithm is complex due to the richness of its features. \n",
"\n",
"For cases like removing multiple `characters` or `stop words`, a `general regex` can be overkill and `nvtext` provides some alternative methods which make this computation much faster. \n",
"\n",
"In this workflow we use the following `nvtext` functions:\n",
"* `nvstrings.replace_multi`: To replace the punctuations with a blank space.\n",
"* `nvtext.replace_tokens`: To replace the tokens with a empty space.\n",
"\n",
"Please checkout our [nightly docs](https://docs.rapids.ai/api/nvstrings/nightly/), we are adding more features everyday. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now back to our workflow:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Removing Filters:\n",
"First, we need to define our list of filter characters."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# remove the following punctuations/characters from cudf\n",
"filters = [ '!', '\"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/', '\\\\', ':', ';', '<', '=', '>',\n",
" '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\\~', '\\t','\\\\n',\"'\",\",\",'~' , '—']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we can simply pass `filters` to the string processing methods inside cuDF and apply it to our Series. We'll eventually make a helper function to let us execute this on every column in the DataFrame. But, let's just take a quick look now on a sample of our text data."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 THE STORY OF THE CHAMPIONS OF THE ROUND TABLE\n",
"1 Written and Illustrated by\n",
"2 HOWARD PYLE.\n",
"3 In 1902 the distinguished American artist Howa...\n",
"4 and illustrate the legend of King Arthur and t...\n",
"Name: text, dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_col_sample = df.head(5)\n",
"text_col_sample['text'].to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 THE STORY OF THE CHAMPIONS OF THE ROUND TABLE\n",
"1 Written and Illustrated by\n",
"2 HOWARD PYLE \n",
"3 In 1902 the distinguished American artist Howa...\n",
"4 and illustrate the legend of King Arthur and t...\n",
"Name: text_clean, dtype: object"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_col_sample['text_clean'] = text_col_sample['text'].str.replace_multi(filters, ' ', regex=False)\n",
"text_col_sample['text_clean'].to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With one method we removed all of the symbols in our `filters` list. Next, we'll want to convert to lowercase with `str.lower()`, just like we used `replace_multi`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### To Lower"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 the story of the champions of the round table\n",
"1 written and illustrated by\n",
"2 howard pyle \n",
"3 in 1902 the distinguished american artist howa...\n",
"4 and illustrate the legend of king arthur and t...\n",
"Name: text_clean, dtype: object"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_col_sample['text_clean'] = text_col_sample['text_clean'].str.lower()\n",
"text_col_sample['text_clean'].to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also remove stopwords with `replace_tokens`. The `nvtext` library makes this easy. We can pass the default list of English stopwords that ships with the `nltk` library. We'll replace each of our stopwords with a single space."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Remove Stop Words"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
}
],
"source": [
"nltk.download('stopwords')\n",
"STOPWORDS = nltk.corpus.stopwords.words('english')\n",
"STOPWORDS = nvstrings.to_device(STOPWORDS)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 story champions round table\n",
"1 written illustrated \n",
"2 howard pyle \n",
"3 1902 distinguished american artist howard ...\n",
"4 illustrate legend king arthur knight...\n",
"Name: text_clean, dtype: object"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_col_sample['text_clean'] = nvtext.replace_tokens(text_col_sample['text_clean'].data, STOPWORDS, ' ')\n",
"text_col_sample['text_clean'].to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Replacing Multiple White Spaces\n",
"\n",
"This looks great, but we'll probably want to replace multiple spaces in a row with a single space and strip leading and trailing spaces. We can do that easily, too.\n",
"\n",
"Replacing multiple spaces with a single space is a common operation, so we're making this even faster the above regex with a new feature coming soon (keep an eye on [this Github issue](https://github.com/rapidsai/custrings/issues/374) for the latest info)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 story champions round table\n",
"1 written illustrated\n",
"2 howard pyle\n",
"3 1902 distinguished american artist howard pyle...\n",
"4 illustrate legend king arthur knights round\n",
"Name: text_clean, dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_col_sample['text_clean'] = text_col_sample['text_clean'].str.replace(r\"\\s+\", ' ',regex=True)\n",
"text_col_sample['text_clean'] = text_col_sample['text_clean'].str.strip(' ')\n",
"text_col_sample['text_clean'].to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With that, we've finished our basic preprocessing steps on a tiny sample of our text column. We'll wrap this into a function for portability, and run it on the entire data. We'll rewrite our code to create our filter list and stopwords again for clarity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Full Pre-processing Pipe-Line"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"STOPWORDS = nltk.corpus.stopwords.words('english')\n",
"\n",
"filters = [ '!', '\"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/', '\\\\', ':', ';', '<', '=', '>',\n",
" '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\\~', '\\t','\\\\n',\"'\",\",\",'~' , '—']\n",
"\n",
"def preprocess_text(input_strs , filters=None , stopwords=STOPWORDS):\n",
" \"\"\"\n",
" * filter punctuation\n",
" * to_lower\n",
" * remove stop words (from nltk corpus)\n",
" * remove multiple spaces with one\n",
" * remove leading spaces \n",
" \"\"\"\n",
" \n",
" # filter punctuation and case conversion\n",
" input_strs = input_strs.str.replace_multi(filters, ' ', regex=False)\n",
" input_strs = input_strs.str.lower()\n",
" \n",
" # remove stopwords\n",
" stopwords_gpu = nvstrings.to_device(stopwords)\n",
" input_strs = nvtext.replace_tokens(input_strs.data, stopwords_gpu, ' ')\n",
" input_strs = cudf.Series(input_strs)\n",
" \n",
" # replace multiple spaces with single one and strip leading/trailing spaces\n",
" input_strs = input_strs.str.replace(r\"\\s+\", ' ', regex=True)\n",
" input_strs = input_strs.str.strip(' ')\n",
" \n",
" return input_strs\n",
"\n",
"def preprocess_text_df(df, text_cols=['text'], **kwargs):\n",
" for col in text_cols:\n",
" df[col] = preprocess_text(df[col], **kwargs)\n",
" return df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With our function defined, we can execute it to preprocess the entire dataset."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.58 s, sys: 726 ms, total: 2.3 s\n",
"Wall time: 2.31 s\n"
]
}
],
"source": [
"%time df = preprocess_text_df(df, filters=filters)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 story champions round table\n",
"1 written illustrated\n",
"2 howard pyle\n",
"3 1902 distinguished american artist howard pyle...\n",
"4 illustrate legend king arthur knights round\n",
"Name: text, dtype: object"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['text'].head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word Count\n",
"\n",
"Lets find the top words used in:\n",
"* Whole dataset\n",
"* by Albert Einstein\n",
"* by Charles dickens"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"## Getting a frequency count for Strings\n",
"\n",
"def get_word_count(str_col):\n",
" \"\"\"\n",
" returns the count of input strings\n",
" \"\"\" \n",
" ## Tokenize: convert sentences into a long list of words\n",
" ## Get counts: Groupby each token to get value counts\n",
"\n",
" df = cudf.DataFrame()\n",
" # tokenize sentences into a string using nvtext.tokenize()\n",
" # it into a single tall data-frame\n",
" df['string'] = nvtext.tokenize(str_col.data)\n",
" \n",
" # Using Group by to do a value count for string columns\n",
" # This will be natively supported soon\n",
" # See: issue https://github.com/rapidsai/cudf/issues/1951\n",
"\n",
" df['counts'] = np.dtype('int32').type(0)\n",
" \n",
" res = df.groupby('string').count()\n",
" res = res.reset_index(drop=False).sort_values(by='counts', ascending=False)\n",
" return res.rename({'index':'string'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Top Words Across the dataset"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2.85 s, sys: 803 ms, total: 3.65 s\n",
"Wall time: 3.66 s\n"
]
}
],
"source": [
"%%time \n",
"\n",
"count_df = get_word_count(df['text'])\n",
"count_df.head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Now lets compare Charles Dickens and Albert Einstein"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Albert Einstein"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>string</th>\n",
" <th>counts</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3002</th>\n",
" <td>theory</td>\n",
" <td>248</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2517</th>\n",
" <td>relativity</td>\n",
" <td>223</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2790</th>\n",
" <td>space</td>\n",
" <td>190</td>\n",
" </tr>\n",
" <tr>\n",
" <th>415</th>\n",
" <td>body</td>\n",
" <td>168</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3030</th>\n",
" <td>time</td>\n",
" <td>160</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" string counts\n",
"3002 theory 248\n",
"2517 relativity 223\n",
"2790 space 190\n",
"415 body 168\n",
"3030 time 160"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"einstein_df = df[df['author'].str.contains('Einstein')]\n",
"einstein_count_df = get_word_count(einstein_df['text'])\n",
"einstein_count_df.head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Charles Dickens"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>string</th>\n",
" <th>counts</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>28485</th>\n",
" <td>mr</td>\n",
" <td>34408</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36881</th>\n",
" <td>said</td>\n",
" <td>32643</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29906</th>\n",
" <td>one</td>\n",
" <td>17684</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47972</th>\n",
" <td>would</td>\n",
" <td>15266</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45593</th>\n",
" <td>upon</td>\n",
" <td>14657</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" string counts\n",
"28485 mr 34408\n",
"36881 said 32643\n",
"29906 one 17684\n",
"47972 would 15266\n",
"45593 upon 14657"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"charles_dickens_df = df[df['author'].str.contains('Charles Dickens')]\n",
"charles_dickens_count_df = get_word_count(charles_dickens_df['text'])\n",
"charles_dickens_count_df.head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"So Einstein is talking about relativity, with words like `relativity`,`theory`,`body` ,\n",
"while Charles Dickens is telling stories with `once`, `upon`, `time` , `old`\n",
"\n",
"Our Word Count seems to be working :-D"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Word Counts for all the authors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Lets get the list of authors for our dataframe"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Abraham Lincoln\n",
"1 Agatha Christie\n",
"2 Albert Einstein\n",
"3 Aldous Huxley\n",
"4 Alexander Pope\n",
"Name: author, dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['author'].unique().to_pandas().head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Calculate the word count for all authors into a list"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 40.4 s, sys: 16.6 s, total: 57 s\n",
"Wall time: 59 s\n"
]
}
],
"source": [
"%%time\n",
"author_wc_ls = []\n",
"author_name_ls = []\n",
"for author_name in df['author'].unique():\n",
" df_auth = df[df['author']==author_name]\n",
" author_wc = get_word_count(df_auth['text'])\n",
" author_wc_ls.append(author_wc)\n",
" author_name_ls.append(author_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Encode the word-count `series` list for all authors into a count-vector\n",
"\n",
"We do this in two steps:\n",
"\n",
"1. Encode the string Series using`top 20k` most used `words` in the Dataset which we calculated earlier.\n",
" * We encode anything not in the series to string_id = `20_000` (threshold)\n",
"\n",
"\n",
"2. With the encoded count series for all authors, we create an aligned word-count vector for them, where:\n",
" * Where each column corresponds to a `word_id` from the the `top 20k words`\n",
" * Each row corresponds to the `count vector` for that author"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Categorize the `string series` from the `word count series` into a `integer series` for all the authors "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def str_to_cat(str_s,keys): \n",
" \"\"\"\n",
" Cast string columm to category(int) using nvcategory\n",
" Codes are index of keys\n",
" any string not in keys is encoded to -1\n",
" \"\"\"\n",
" from librmm_cffi import librmm\n",
" \n",
" cat = nvcategory.from_strings(str_s.data).set_keys(keys)\n",
" device_array = librmm.device_array(str_s.data.size(), dtype=np.int32) \n",
" cat.values(devptr=device_array.device_ctypes_pointer.value)\n",
" \n",
" return cudf.Series(device_array)\n",
"\n",
"def encode_count_df(auth_wc_df,keys,out_of_dict_id):\n",
" \"\"\"\n",
" Encode the count series for all authors by using the index provided in keys\n",
" All strings not in keys are mapped to out_of_dict_id and their count is summed\n",
" \"\"\"\n",
" # any string not in keys is encoded to -1\n",
" auth_wc_df['encoded_str_id'] = str_to_cat(auth_wc_df['string'],keys)\n",
" \n",
" # sub df which contains words that are in the dictionary\n",
" in_dict_wc_df = auth_wc_df[auth_wc_df['encoded_str_id']!=-1]\n",
" \n",
" # sum of `count series` of words not in dictionary \n",
" out_of_dict_wcount = auth_wc_df[auth_wc_df['encoded_str_id']==-1]['counts'].sum()\n",
" \n",
" # mapping out the count of words to -1\n",
" out_of_dict_df = cudf.DataFrame({'encoded_str_id':out_of_dict_id,'counts': out_of_dict_wcount,'string':'other'})\n",
" \n",
" # by default cudf creates 64 bit arrays from dict\n",
" # remap them to 32 bits to line up with in_dict_wc_df\n",
" out_of_dict_df['encoded_str_id'] = out_of_dict_df['encoded_str_id'].astype(np.int32)\n",
" out_of_dict_df['counts'] = out_of_dict_df['counts'].astype(np.int32)\n",
" \n",
" return cudf.concat([in_dict_wc_df,out_of_dict_df])"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 6.41 s, sys: 144 ms, total: 6.55 s\n",
"Wall time: 6.7 s\n"
]
}
],
"source": [
"%%time\n",
"# keep only top 20k words in the dataset\n",
"th = 20_000\n",
"keys = count_df['string'][:th].data\n",
"encoded_wc_ls = []\n",
"\n",
"for auth_wc_df in author_wc_ls:\n",
" encoded_count_df = encode_count_df(auth_wc_df,keys,th)\n",
" encoded_wc_ls.append(encoded_count_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Now lets check if the encoding worked ! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Agatha Christie Counts"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Agatha Christie\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>string</th>\n",
" <th>counts</th>\n",
" <th>encoded_str_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6490</th>\n",
" <td>said</td>\n",
" <td>738</td>\n",
" <td>15455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7862</th>\n",
" <td>tuppence</td>\n",
" <td>626</td>\n",
" <td>18503</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7718</th>\n",
" <td>tommy</td>\n",
" <td>597</td>\n",
" <td>18137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5159</th>\n",
" <td>one</td>\n",
" <td>505</td>\n",
" <td>12506</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4894</th>\n",
" <td>mr</td>\n",
" <td>455</td>\n",
" <td>11880</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" string counts encoded_str_id\n",
"6490 said 738 15455\n",
"7862 tuppence 626 18503\n",
"7718 tommy 597 18137\n",
"5159 one 505 12506\n",
"4894 mr 455 11880"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"author_id = author_name_ls.index('Agatha Christie') \n",
"print(author_name_ls[author_id])\n",
"encoded_wc_ls[author_id].head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Charles Dickens Counts"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Charles Dickens\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>string</th>\n",
" <th>counts</th>\n",
" <th>encoded_str_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>28485</th>\n",
" <td>mr</td>\n",
" <td>34408</td>\n",
" <td>11880</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36881</th>\n",
" <td>said</td>\n",
" <td>32643</td>\n",
" <td>15455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29906</th>\n",
" <td>one</td>\n",
" <td>17684</td>\n",
" <td>12506</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47972</th>\n",
" <td>would</td>\n",
" <td>15266</td>\n",
" <td>19819</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45593</th>\n",
" <td>upon</td>\n",
" <td>14657</td>\n",
" <td>18861</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" string counts encoded_str_id\n",
"28485 mr 34408 11880\n",
"36881 said 32643 15455\n",
"29906 one 17684 12506\n",
"47972 would 15266 19819\n",
"45593 upon 14657 18861"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"author_id = author_name_ls.index('Charles Dickens') \n",
"print(author_name_ls[author_id])\n",
"encoded_wc_ls[author_id].head(5).to_pandas()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>string</th>\n",
" <th>counts</th>\n",
" <th>encoded_str_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>28485</th>\n",
" <td>mr</td>\n",
" <td>34408</td>\n",
" <td>11880</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36881</th>\n",
" <td>said</td>\n",
" <td>32643</td>\n",
" <td>15455</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29906</th>\n",
" <td>one</td>\n",
" <td>17684</td>\n",
" <td>12506</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47972</th>\n",
" <td>would</td>\n",
" <td>15266</td>\n",
" <td>19819</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45593</th>\n",
" <td>upon</td>\n",
" <td>14657</td>\n",
" <td>18861</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" string counts encoded_str_id\n",
"28485 mr 34408 11880\n",
"36881 said 32643 15455\n",
"29906 one 17684 12506\n",
"47972 would 15266 19819\n",
"45593 upon 14657 18861"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoded_wc_ls[author_id].head(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### We can see that the encoded_str_id for `said` is `15455` for both `Charles Dickens` and `Agatha Christie`. Yaay! the encoding worked"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a aligned word-count vector for each author:\n",
"\n",
"We create a dataframe, where a row represents a `author` and the columnss contain the count of the `words` respresented by that `column`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a numba nd-array of shape (`num_authors`,`Vocablary Size+1`))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"num_authors = len(encoded_wc_ls)\n",
"count_ary = np.zeros(shape = (num_authors,th+1), dtype=np.int32)\n",
"count_dary = cuda.to_device(count_ary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fill the count array using a numba function:\n",
"\n",
"Apply the numba function to fill the `author_count_array` with the count of words used by the `author`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Numba Function`: See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html for more `info` on how to write `cuda-jit` functions."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 184 ms, sys: 3.87 ms, total: 187 ms\n",
"Wall time: 185 ms\n"
]
}
],
"source": [
"%%time\n",
"\n",
"@cuda.jit('void(int32[:], int32[:], int32[:])')\n",
"def count_vec_func(author_token_id_array,author_token_count_array,author_count_array):\n",
" \n",
" pos = cuda.grid(1)\n",
" if pos < author_token_id_array.size:\n",
" token_id = author_token_id_array[pos]\n",
" token_count = author_token_count_array[pos]\n",
" author_count_array[token_id] = token_count \n",
" \n",
"for author_id,encoded_wc_df in enumerate(encoded_wc_ls): \n",
" count_sr = encoded_wc_df['counts']\n",
" token_id_sr = encoded_wc_df['encoded_str_id']\n",
" \n",
" count_ar = count_sr.data.to_gpu_array()\n",
" token_id_ar = token_id_sr.data.to_gpu_array()\n",
" author_ar = count_dary[author_id]\n",
" \n",
" # See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html\n",
" threadsperblock = 36\n",
" blockspergrid = (count_ar.size + (threadsperblock - 1)) // threadsperblock\n",
" count_vec_func[blockspergrid, threadsperblock](token_id_ar,count_ar,author_ar)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now, Lets check if creating the count vectors worked !"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Agatha Christie\n",
"15455 : 738\n",
"18503 : 626\n",
"18137 : 597\n",
"12506 : 505\n",
"11880 : 455\n"
]
}
],
"source": [
"author_id = author_name_ls.index('Agatha Christie') \n",
"\n",
"print(author_name_ls[author_id])\n",
"top_word_ids = encoded_wc_ls[author_id]['encoded_str_id'].head(5).to_pandas()\n",
"for word_id in top_word_ids:\n",
" print(\"{} : {}\".format(word_id,count_dary[author_id][word_id]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lets find the Nearest Authors \n",
"\n",
"Now your count df is ready for ML\n",
"\n",
"Let's train a KNN on the count-df and see if we can find any interesting patterns in count_df. Though `euclidian distance` is not the best measure for these higher dimensional spaces but it still works as a small toy example. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Normalize Counts"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"normalized_count_array = count_dary/np.sum(count_dary,axis=1)[:,None]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Train and find nearest_neighours on the non embedded space"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1.86 s, sys: 323 ms, total: 2.19 s\n",
"Wall time: 890 ms\n"
]
}
],
"source": [
"%%time\n",
"nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)\n",
"nn_model.fit(normalized_count_array)\n",
"ouput_mat,output_indices_count_sp = nn_model.kneighbors(X=normalized_count_array)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Nearest authors to Albert Einstein in the count vector space"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Albert Einstein\n",
"Thomas Carlyle\n",
"Lord Byron\n",
"James Russell Lowell\n",
"Michael Faraday\n"
]
}
],
"source": [
"author_id = author_name_ls.index('Albert Einstein') \n",
"for index in output_indices_count_sp[author_id]:\n",
" print(author_name_ls[int(index)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Nearest authors to Charles Dickens in the count vector space"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Charles Dickens\n",
"Winston Churchill\n",
"Charlotte Mary Yonge\n",
"Harriet Elizabeth Beecher Stowe\n",
"William Dean Howells\n"
]
}
],
"source": [
"author_id = author_name_ls.index('Charles Dickens') \n",
"for index in output_indices_count_sp[author_id]:\n",
" print(author_name_ls[int(index)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Encode the count vecotrs to a lower dimention using Umap\n",
"**Currently [cuml.umap](https://rapidsai.github.io/projects/cuml/en/latest/api.html#umap) does not support the random seed so results may change on muliple runs"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"embedding_ar_gpu = cuml.UMAP(n_neighbors=100,n_components=3).fit_transform(normalized_count_array)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### KNN in the lower dimentional space"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 941 ms, sys: 102 ms, total: 1.04 s\n",
"Wall time: 105 ms\n"
]
}
],
"source": [
"%%time\n",
"nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)\n",
"nn_model.fit(embedding_ar_gpu)\n",
"ouput_mat,output_indices_umap = nn_model.kneighbors(X=embedding_ar_gpu)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Nearest authors to Albert Einstein in the emdedded space"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Albert Einstein\n",
"Thomas Crofton Croker\n",
"Robert Hooke\n",
"John Milton\n",
"Thomas Carlyle\n"
]
}
],
"source": [
"author_id = author_name_ls.index('Albert Einstein') \n",
"for index in output_indices_umap[author_id]:\n",
" print(author_name_ls[int(index)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Nearest authors to Charles Dickens in the emdedded space"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Charles Dickens\n",
"Harriet Elizabeth Beecher Stowe\n",
"Lucy Maud Montgomery\n",
"Agatha Christie\n",
"Louisa May Alcott\n"
]
}
],
"source": [
"author_id = author_name_ls.index('Charles Dickens') \n",
"for index in output_indices_umap[author_id]:\n",
" print(author_name_ls[int(index)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Want to get started with RAPIDS? Check out [`cuDF`](https://github.com/rapidsai/cudf) on Github and let us know what you think! You can download pre-built Docker containers for our 0.8 release from [NGC](https://ngc.nvidia.com/catalog/landing) or [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai/) to get started, or install it yourself via Conda. Need something even easier? You can quickly get started with RAPIDS in [Google Colab](https://colab.research.google.com/drive/1XTKHiIcvyL5nuldx0HSL_dUa8yopzy_Y#forceEdit=true&offline=true&sandboxMode=true) and try out all the new things we've added with just a single push of a button.\n",
"\n",
"Don't want to wait for the next release to use upcoming features? You can download our nightly containers from [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai-nightly) or install via [Conda](https://anaconda.org/rapidsai-nightly) to stay at the tip of our development branch."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment