Skip to content

Instantly share code, notes, and snippets.

@ilyarudyak
Created May 21, 2019 11:27
Show Gist options
  • Save ilyarudyak/47d7a4c25d041dbf413e1ec6b7490f8f to your computer and use it in GitHub Desktop.
Save ilyarudyak/47d7a4c25d041dbf413e1ec6b7490f8f to your computer and use it in GitHub Desktop.
Description of data processing for cs224n assignment2 word2vec 2019.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The autoreload extension is already loaded. To reload it, use:\n",
" %reload_ext autoreload\n"
]
}
],
"source": [
"import numpy as np\n",
"import random\n",
"from utils.treebank import StanfordSentiment\n",
"from word2vec import getNegativeSamples\n",
"\n",
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## what is dataset?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have special class `StanfordSentiment` for handling data. For simplicity we created toy files, including `datasetSentences_small.txt` contains first 10 lines of the original file including 9 reviews:"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sentence_index\tsentence\r\n",
"1\tThe Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .\r\n",
"2\tThe gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer\\/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .\r\n"
]
}
],
"source": [
"!head -n 3 ~/data/cs224n/stanfordSentimentTreebank/datasetSentences_small.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`StanfordSentiment` has a lot of methods. Here we briefly describe main ones and in next chaptes we'll describe other methods. We have (for each those fields we have a method):\n",
"\n",
"- `_tokens` - dictionary that contains all unique words from our corpus and their index;\n",
"- `_sentences` - list of lists, where an inner list contains normalized and splitted reviews;"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 0), ('rock', 1), ('is', 2), ('destined', 3), ('to', 4)]"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset = StanfordSentiment()\n",
"dataset.tokens() # build dictionary of _tokens\n",
"list(dataset._tokens.items())[:5]"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['the', 'rock', 'is', 'destined', 'to']"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset._sentences[0][:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use dataset to generate negative and positive examples. We describe this in details below. Finally we see how data are fed into our model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## how do we sample negative words?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [slp3, 6.8.2] we may read that: \"The noise words are chosen according to their \n",
" weighted unigram frequency $P_\\alpha(w)$, where $\\alpha$ is a weight\". It's computed using formula with $\\alpha=.75$: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$P_\\alpha(w) = \\frac{count(w)^\\alpha}{\\sum count(w^{'})^\\alpha}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To sample negative words we use `getNegativeSamples()`. The only purpose of this function is clearly stated in its description: \"Samples K indexes which are not the outsideWordIdx\". We should remember that: \"... skipgram uses more negative examples than positive examples, the ratio set by a parameter k.\"\n",
"\n",
"Internally `getNegativeSamples()` calls `dataset.sampleTokenIdx()` which is just chooses random token index from `sampleTable`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### what is `sampleTable`?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have to build a table of weighted probabilities $P_\\alpha$ using formula above. We start from row frequencies.\n",
"\n",
"We may find frequencies for words in `dataset._tokenfreq` (this is just count of word appearance in our corpus, for example, `the` appeared 12 times):"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 12), ('rock', 1), ('is', 4), ('destined', 1), ('to', 8)]"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(dataset._tokenfreq.items())[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have to weight our frequencies (see `[slp3], 6.31`):"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"nTokens = len(dataset.tokens())\n",
"samplingFreq = np.zeros((nTokens,))\n",
"dataset.allSentences()\n",
"i = 0\n",
"for w in range(nTokens):\n",
" w = dataset._revtokens[i]\n",
" if w in dataset._tokenfreq:\n",
" freq = 1.0 * dataset._tokenfreq[w]\n",
" # Reweight\n",
" freq = freq ** 0.75\n",
" else:\n",
" freq = 0.0\n",
" samplingFreq[i] = freq\n",
" i += 1"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([6.44741959, 1. , 2.82842712, 1. , 4.75682846])"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"samplingFreq[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may see that for `the` we have `6.44741959` instead of 12. Let's check that this is correct:"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6.4474195909412515"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"12 ** .75"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may use these weighted frequencies for negative sampling. But for some reason we build another table with size much bigger than our vocabulary. For example for our 135 words we may use table of size 1000 (specified in `__init__` method). This table contains sequencies of the same index for all our words (we skip here technical details of how it can be build):"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],\n",
" [133, 133, 133, 133, 133, 134, 134, 134, 134, 134])"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.sampleTable()[30:50], dataset.sampleTable()[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we may sample from this table:"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"127"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.sampleTokenIdx()"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[132, 69, 96, 36, 4]"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"getNegativeSamples(outsideWordIdx=0, dataset=dataset, K=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## how do we use dataset?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use dataset to produce both positive and negative examples: \n",
"- We produce negative examples using helper function `getNegativeSamples()` which in turn calls `dataset.sampleTokenIdx()`. We produce them inside `negSamplingLossAndGradient`.\n",
"We described generation of negative samples above. \n",
"- We produce positive examples with `dataset.getRandomContext()` in `word2vec_sgd_wrapper`. `dataset.getRandomContext()` produces pair `centerword, context`:"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('mindset', ['film'])"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset.getRandomContext()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We produce this pair at random: first we choose a sentence at random from all sentences and then we choose center words also at random. In example below we choose `wordID=7`, so:\n",
"\n",
"- `centerword=sent[7]` or `'21st'`;\n",
"- context with `window_size=2` is `['be', 'the', 'century', \"'s\"]`;"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', \"'s\"]"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sent = dataset._sentences[0]\n",
"sent[:10]"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"wordID=7\n"
]
},
{
"data": {
"text/plain": [
"('21st', ['be', 'the', 'century', \"'s\"])"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random.seed(42)\n",
"wordID = random.randint(0, len(sent) - 1)\n",
"print(f'wordID={wordID}')\n",
"C=2 # window size\n",
"context = sent[max(0, wordID - C):wordID]\n",
"if wordID+1 < len(sent):\n",
" context += sent[wordID+1:min(len(sent), wordID + C + 1)]\n",
"\n",
"centerword = sent[wordID]\n",
"context = [w for w in context if w != centerword]\n",
"\n",
"centerword, context"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But there're some technical details yet again. Instead of using `_sentences` we use `_allsentences`. What is the difference between them? This is not documented. One of the reasons to do this is maybe to reduce using frequent words yet again."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## how data are fed into our model?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we may see that:\n",
"- positive examples is a pair `centerword, context`;\n",
"- negative examples is a list of indicies ;\n",
"\n",
"Remember that we store dictionary of the form `{word: index}` in `_tokens`."
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 0), ('rock', 1), ('is', 2), ('destined', 3), ('to', 4)]"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(dataset._tokens.items())[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In `pytorch` when we're working with `Embedding` layer we use indicies. Why is that? Well, you may read explanation in [nlp-pytorch, chapter 5]: *\"By definition, the weight matrix of a Linear layer that accepts as input this one-hot vector must have the same number of rows as the size of the one-hot vector. When you perform the matrix multiplication, as shown in Figure 5-1, the resulting vector is actually just selecting the row indicated by the non zero entry. Based on this observation, we can just skip the multiplication step and instead directly use an integer as an index to retrieve the selected row.\"*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So in case of negative samples we already have list of indicies that we use to retrive vectors from `outsideVectors` in `negSamplingLossAndGradient`."
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[93, 25, 20, 18, 10]"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"negSampleWordIndices = getNegativeSamples(outsideWordIdx=0, dataset=dataset, K=5)\n",
"negSampleWordIndices"
]
},
{
"cell_type": "code",
"execution_count": 133,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((135, 3), array([-0.3853136 , 0.11351735, 0.66213067]))"
]
},
"execution_count": 133,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.seed(42)\n",
"outsideVectors = np.random.randn(len(dataset._tokens), 3)\n",
"outsideVectors.shape, outsideVectors[negSampleWordIndices[0], :]"
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.3853136 , 0.11351735, 0.66213067],\n",
" [ 0.8219025 , 0.08704707, -0.29900735],\n",
" [-0.47917424, -0.18565898, -1.10633497],\n",
" [ 1.03099952, 0.93128012, -0.83921752],\n",
" [-0.60170661, 1.85227818, -0.01349722]])"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"negSampledVect = outsideVectors[negSampleWordIndices, :]\n",
"negSampledVect"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In case of positive examples we just convert words to indicies inside `skipgram`."
]
},
{
"cell_type": "code",
"execution_count": 137,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 137,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word2Ind = dataset._tokens\n",
"currentCenterWord = 'the'\n",
"currentCenterWordIdx = word2Ind[currentCenterWord]\n",
"currentCenterWordIdx"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment