Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active February 6, 2020 17:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bigsnarfdude/01bd806267e560349e1e15700cd90d9d to your computer and use it in GitHub Desktop.
Save bigsnarfdude/01bd806267e560349e1e15700cd90d9d to your computer and use it in GitHub Desktop.
Gensim Python3.5 Updated Yelp dataset WMD tutorial
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Finding similar documents with Word2Vec and WMD \n",
"\n",
"Word Mover's Distance is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. For example, in a blog post [OpenTable](http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/) use WMD on restaurant reviews. Using this approach, they are able to mine different aspects of the reviews. In **part 2** of this tutorial, we show how you can use Gensim's `WmdSimilarity` to do something similar to what OpenTable did. In **part 1** shows how you can compute the WMD distance between two documents using `wmdistance`. Part 1 is optional if you want use `WmdSimilarity`, but is also useful in it's own merit.\n",
"\n",
"First, however, we go through the basics of what WMD is.\n",
"\n",
"## Word Mover's Distance basics\n",
"\n",
"WMD is a method that allows us to assess the \"distance\" between two documents in a meaningful way, even when they have no words in common. It uses [word2vec](http://rare-technologies.com/word2vec-tutorial/) [4] vector embeddings of words. It been shown to outperform many of the state-of-the-art methods in *k*-nearest neighbors classification [3].\n",
"\n",
"WMD is illustrated below for two very similar sentences (illustration taken from [Vlad Niculae's blog](http://vene.ro/blog/word-movers-distance-in-python.html)). The sentences have no words in common, but by matching the relevant words, WMD is able to accurately measure the (dis)similarity between the two sentences. The method also uses the bag-of-words representation of the documents (simply put, the word's frequencies in the documents), noted as $d$ in the figure below. The intution behind the method is that we find the minimum \"traveling distance\" between documents, in other words the most efficient way to \"move\" the distribution of document 1 to the distribution of document 2.\n",
"\n",
"<img src='https://vene.ro/images/wmd-obama.png' height='600' width='600'>\n",
"\n",
"\n",
"This method was introduced in the article \"From Word Embeddings To Document Distances\" by Matt Kusner et al. ([link to PDF](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf)). It is inspired by the \"Earth Mover's Distance\", and employs a solver of the \"transportation problem\".\n",
"\n",
"In this tutorial, we will learn how to use Gensim's WMD functionality, which consists of the `wmdistance` method for distance computation, and the `WmdSimilarity` class for corpus based similarity queries.\n",
"\n",
"> **Note**:\n",
">\n",
"> If you use this software, please consider citing [1], [2] and [3].\n",
">\n",
"\n",
"## Running this notebook\n",
"\n",
"You can download this [iPython Notebook](http://ipython.org/notebook.html), and run it on your own computer, provided you have installed Gensim, PyEMD, NLTK, and downloaded the necessary data.\n",
"\n",
"The notebook was run on an Ubuntu machine with an Intel core i7-4770 CPU 3.40GHz (8 cores) and 32 GB memory. Running the entire notebook on this machine takes about 3 minutes.\n",
"\n",
"## Part 1: Computing the Word Mover's Distance\n",
"\n",
"To use WMD, we need some word embeddings first of all. You could train a word2vec (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will start by downloading some pre-trained word2vec embeddings. Download the GoogleNews-vectors-negative300.bin.gz embeddings [here](https://code.google.com/archive/p/word2vec/) (warning: 1.5 GB, file is not needed for part 2). Training your own embeddings can be beneficial, but to simplify this tutorial, we will be using pre-trained embeddings at first.\n",
"\n",
"Let's take some sentences to compute the distance between."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from time import time\n",
"start_nb = time()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Initialize logging.\n",
"import logging\n",
"logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')\n",
"\n",
"sentence_obama = 'Obama speaks to the media in Illinois'\n",
"sentence_president = 'The president greets the press in Chicago'\n",
"sentence_obama = sentence_obama.lower().split()\n",
"sentence_president = sentence_president.lower().split()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These sentences have very similar content, and as such the WMD should be low. Before we compute the WMD, we want to remove stopwords (\"the\", \"to\", etc.), as these do not contribute a lot to the information in the sentences."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n"
]
}
],
"source": [
"# Import and download stopwords from NLTK.\n",
"from nltk.corpus import stopwords\n",
"from nltk import download\n",
"download('stopwords') # Download stopwords list.\n",
"\n",
"# Remove stopwords.\n",
"stop_words = stopwords.words('english')\n",
"sentence_obama = [w for w in sentence_obama if w not in stop_words]\n",
"sentence_president = [w for w in sentence_president if w not in stop_words]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, as mentioned earlier, we will be using some downloaded pre-trained embeddings. We load these into a Gensim Word2Vec model class. Note that the embeddings we have chosen here require a lot of memory."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cell took 88.06 seconds to run.\n"
]
}
],
"source": [
"start = time()\n",
"import os\n",
"\n",
"from gensim.models import KeyedVectors\n",
"if not os.path.exists('/home/ubuntu/dev/gensim/docs/notebooks/data/w2v_googlenews/GoogleNews-vectors-negative300.bin.gz'):\n",
" raise ValueError(\"SKIP: You need to download the google news model\")\n",
" \n",
"model = KeyedVectors.load_word2vec_format('/home/ubuntu/dev/gensim/docs/notebooks/data/w2v_googlenews/GoogleNews-vectors-negative300.bin.gz', binary=True)\n",
"\n",
"print('Cell took %.2f seconds to run.' % (time() - start))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So let's compute WMD using the `wmdistance` method."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance = 3.3741\n"
]
}
],
"source": [
"distance = model.wmdistance(sentence_obama, sentence_president)\n",
"print('distance = %.4f' % distance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance = 4.3802\n"
]
}
],
"source": [
"sentence_orange = 'Oranges are my favorite fruit'\n",
"sentence_orange = sentence_orange.lower().split()\n",
"sentence_orange = [w for w in sentence_orange if w not in stop_words]\n",
"\n",
"distance = model.wmdistance(sentence_obama, sentence_orange)\n",
"print('distance = %.4f' % distance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalizing word2vec vectors\n",
"\n",
"When using the `wmdistance` method, it is beneficial to normalize the word2vec vectors first, so they all have equal length. To do this, simply call `model.init_sims(replace=True)` and Gensim will take care of that for you.\n",
"\n",
"Usually, one measures the distance between two word2vec vectors using the cosine distance (see [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)), which measures the angle between vectors. WMD, on the other hand, uses the Euclidean distance. The Euclidean distance between two vectors might be large because their lengths differ, but the cosine distance is small because the angle between them is small; we can mitigate some of this by normalizing the vectors.\n",
"\n",
"Note that normalizing the vectors can take some time, especially if you have a large vocabulary and/or large vectors.\n",
"\n",
"Usage is illustrated in the example below. It just so happens that the vectors we have downloaded are already normalized, so it won't do any difference in this case."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cell took 20.26 seconds to run.\n"
]
}
],
"source": [
"# Normalizing word2vec vectors.\n",
"start = time()\n",
"\n",
"model.init_sims(replace=True) # Normalizes the vectors in the word2vec class.\n",
"\n",
"distance = model.wmdistance(sentence_obama, sentence_president) # Compute WMD as normal.\n",
"\n",
"print('Cell took %.2f seconds to run.' %(time() - start))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Similarity queries using `WmdSimilarity`\n",
"\n",
"You can use WMD to get the most similar documents to a query, using the `WmdSimilarity` class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.\n",
"\n",
"> **Important note:**\n",
">\n",
"> WMD is a measure of *distance*. The similarities in `WmdSimilarity` are simply the *negative distance*. Be careful not to confuse distances and similarities. Two similar documents will have a *high* similarity score and a small distance; two very different documents will have *low* similarity score, and a large distance.\n",
"\n",
"### Yelp data\n",
"\n",
"Let's try similarity queries using some real world data. For that we'll be using Yelp reviews, available at http://www.yelp.com/dataset_challenge. Specifically, we will be using reviews of a single restaurant, namely the [Mon Ami Gabi](http://en.yelp.be/biz/mon-ami-gabi-las-vegas-2).\n",
"\n",
"To get the Yelp data, you need to register by name and email address. The data is 775 MB.\n",
"\n",
"This time around, we are going to train the Word2Vec embeddings on the data ourselves. One restaurant is not enough to train Word2Vec properly, so we use 6 restaurants for that, but only run queries against one of them. In addition to the Mon Ami Gabi, mentioned above, we will be using:\n",
"\n",
"* [Earl of Sandwich](http://en.yelp.be/biz/earl-of-sandwich-las-vegas).\n",
"* [Wicked Spoon](http://en.yelp.be/biz/wicked-spoon-las-vegas).\n",
"* [Serendipity 3](http://en.yelp.be/biz/serendipity-3-las-vegas).\n",
"* [Bacchanal Buffet](http://en.yelp.be/biz/bacchanal-buffet-las-vegas-7).\n",
"* [The Buffet](http://en.yelp.be/biz/the-buffet-las-vegas-6).\n",
"\n",
"The restaurants we chose were those with the highest number of reviews in the Yelp dataset. Incidentally, they all are on the Las Vegas Boulevard. The corpus we trained Word2Vec on has 18957 documents (reviews), and the corpus we used for `WmdSimilarity` has 4137 documents.\n",
"\n",
"Below a JSON file with Yelp reviews is read line by line, the text is extracted, tokenized, and stopwords and punctuation are removed.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
],
"source": [
"# Pre-processing a document.\n",
"\n",
"from nltk import word_tokenize\n",
"download('punkt') # Download data for tokenizer.\n",
"\n",
"def preprocess(doc):\n",
" doc = doc.lower() # Lower the text.\n",
" doc = word_tokenize(doc) # Split into words.\n",
" doc = [w for w in doc if not w in stop_words] # Remove stopwords.\n",
" doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.\n",
" return doc"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cell took 60.87 seconds to run.\n"
]
}
],
"source": [
"start = time()\n",
"\n",
"import json\n",
"\n",
"# Business IDs of the restaurants.\n",
"\n",
"ids = ['4JNXUYY8wbaaDmk3BPzlWw', 'fE7x3Ui2mzdwdfJnd7r_1g', 'K7lWdNUhCbcnEvI0NhGewg',\n",
" 'eoHdUeQDNgQ6WYEnP2aiRw','RESDUcs7fIiihp38-d6_6g','dYMhfzyZyklXELmYq_wfKg']\n",
"\n",
"# old ids \n",
"# ids = ['4bEjOyTaDG24SY5TxsaUNQ', '2e2e7WgqU1BnpxmQL5jbfw', 'zt1TpTuJ6y9n551sw9TaEg',\n",
"# 'Xhg93cMdemu5pAMkDoEdtQ', 'sIyHTizqAiGu12XMLX3N3g', 'YNQgak-ZLtYJQxlDwN-qIg']\n",
"\n",
"w2v_corpus = [] # Documents to train word2vec on (all 6 restaurants).\n",
"wmd_corpus = [] # Documents to run queries against (only one restaurant).\n",
"documents = [] # wmd_corpus, with no pre-processing (so we can see the original documents).\n",
"with open('/home/ubuntu/dev/gensim/docs/notebooks/data/yelp/review.json') as data_file:\n",
" for line in data_file:\n",
" json_line = json.loads(line)\n",
" #print(json_line)\n",
" \n",
" if json_line['business_id'] not in ids:\n",
" # Not one of the 6 restaurants.\n",
" continue\n",
" \n",
" # Pre-process document.\n",
" text = json_line['text'] # Extract text from JSON object.\n",
" text = preprocess(text)\n",
" \n",
" # Add to corpus for training Word2Vec.\n",
" w2v_corpus.append(text)\n",
" \n",
" if json_line['business_id'] == ids[0]:\n",
" # Add to corpus for similarity queries.\n",
" wmd_corpus.append(text)\n",
" documents.append(json_line['text'])\n",
"\n",
"print('Cell took %.2f seconds to run.' %(time() - start))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is a plot with a histogram of document lengths and includes the average document length as well. Note that these are the pre-processed documents, meaning stopwords are removed, punctuation is removed, etc. Document lengths have a high impact on the running time of WMD, so when comparing running times with this experiment, the number of documents in query corpus (about 4000) and the length of the documents (about 62 words on average) should be taken into account."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"# Document lengths.\n",
"lens = [len(doc) for doc in wmd_corpus]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py:800: MatplotlibDeprecationWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.\n",
" mplDeprecation)\n",
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:9: MatplotlibDeprecationWarning: pyplot.hold is deprecated.\n",
" Future behavior will be consistent with the long-time default:\n",
" plot commands add elements without first clearing the\n",
" Axes and/or Figure.\n",
" if __name__ == '__main__':\n",
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py:805: MatplotlibDeprecationWarning: axes.hold is deprecated. Please remove it from your matplotlibrc and/or style files.\n",
" mplDeprecation)\n",
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/rcsetup.py:155: MatplotlibDeprecationWarning: axes.hold is deprecated, will be removed in 3.0\n",
" mplDeprecation)\n",
"/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: MatplotlibDeprecationWarning: pyplot.hold is deprecated.\n",
" Future behavior will be consistent with the long-time default:\n",
" plot commands add elements without first clearing the\n",
" Axes and/or Figure.\n",
" del sys.path[0]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfsAAAGPCAYAAABbOHkFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAIABJREFUeJzt3Xu8HEWd///XW8hCIFyCSTZchJgV\n1AUUMSLhCxqUCIguu+Iqygq4CHhjQUAELxBYXUQRQVdXgv5AsgKKiAoqV+UiATQoAspNJKhAbsoC\ngRAw1O+P7pNMJuckc04mnNDn9Xw8+jHTVdXV1ZXJ+Ux1V/eklIIkSWquFwx2AyRJ0qplsJckqeEM\n9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmCvVSbJNUmuGex2DAVJJie5JcmCJCXJuH5uP67e7sBV0sAh\n5vnUn0km1W3dd7DbolXHYK+OJDmw/oOwYx/5X0uy0g9tSLJNkin9DVZDWZINgO8AAQ4D3gPMHdRG\nNUySzerP5XaD3ZaBSvLh58OXD60aaw52A9RobxrANtsAJwDXADO72ZgGexWwIXBSKeX7g92YhtqM\n6nM5E7h1cJsyYB8GZgHnDHI7NAgc2WuVKaU8XUp5erDb0V9Jhg92G/ppTP36f4PaCkmrLYO9Vpne\nrtkneUeSXyZ5LMmjSW5PcnyddyBwfl30Z/Vlg6Wueyb5pyQ3J3kyySNJLk7ysl72vXNd7qkkDyQ5\nJsl7269nJ5mZ5LIkuya5KclTwMda9nVpkgeTLKzrOSXJWm37Oqfez6ZJvp/k8SQPJzmyzt+y3sf8\nuq5D+9GHByW5ra5/TpJzk2zS2sfAt9v67Jre6mrZZuMkF9bt/GuSbwDr91F25yQ/rdv+eJIrk7y2\nl3IbJPl8kj/UffVgkvOSbFrn91wGGte23TLXtrvRn0n+Lsmnktxdt2dWfalpw7ZyPf/+OyT5eao5\nD39KckRLmUnAjfXq2S2fyynL6+c++nPjJFOTPFS36976s5le+uTYJPsnuasue1uS3Xqpc4Wf9SQz\ngZcCr29p/8xlq8qRdZ88leTGJK9qK/D3Sb5e99HC+jguTbJtf/tCzy1P46u/Nkgyqpf0tVe0Yf2H\n6gLgp8BxwCKqP0Cvq4tcB3wF+BDwX8Cddfr0evt3Ad8CfgN8EtiA6hr19CSvLqXcX5d7JXAF8Bfg\nP4GngYOBJ/po2kuAi4CvA/8f8Mc6/d+BvwFfBh4BdgKOBl4EvLutjhcAPwFuAo4B3gl8IcljdVsv\nBn5Y1/m1JL8opfx6Bf11LHBy3S8fBTanOhW7S5JXlVL+D/gMcEdbn81eTp1rA1fXx/zfwP3A24Bz\neyn7OuBK4M/Ap+tjfD9wbZLXl1JursutC1wLbAt8E/glsBHw5no/Dy7vOPsw4P6sA+fFwBuo/k1v\nB/6Bqu9enWSnUsozLft6MXAJ1entb9X7+mKS35VSrqDq0yn1MhW4vt7utv4cUJIx9fEMq+t5GNgF\nOAXYBDiibZO3A6OAM4EFdf7FSbYopfy1rrPTz/oRwFeBR6k+MwDz28ocVbfty/XrR4HvJ3lJS399\nl+rfueezM5rq/+9LqfpZq6tSiovLChfgQKCsaGnb5hrgmpb1L1L9sVljOfvZt65rUlv6MKo/jncB\n67akv4rqS8P/tqT9AHgK2KIlbRTw17rucS3pM+u0f+qlLev0kvZJ4Flgs5a0c+o6PtWSNgJ4rC77\n7y3pm1B/gVhBf4+qj+FnwJot6XvX+/r0ivqsj3oPq8u+pyVtDaovFAU4sCV9BlUQGd2StinwOPDz\nlrQp9bb79bK/tH1+xrXlj+tlvyvVn1RfxJ4Fdm3b15t7Ofaef//dWtLWovrCdGFL2o7t7VxBP/d2\nXFPrev++reznqD7D49q2faSt77er0z80wM/6XbT8f2xJn1SXvRdYuyX9n+v0ver1Der1ozvpA5fV\na/E0vvrrP4DJvSyXdrDtY8C6DGzi3quBscBXSymLRy2lGs1dBbw5lTXq9vywlPJAS7l5wHl91P3n\nUsoP2xNLKU8CJHlBfZp6FFVQDLB9L/V8vWXb+cBvqf6IT2tJfwj4E9VIc3l2owo6Xyyl/K1l+x8A\ndwN7rWD7vrwFmENLX5RSFlGN5hZLMpaqz79ZSpnbUvbBetudkoysk98O/K6U8q32nZU6SgzQQPvz\nHcA9wO1JRvUswC+oRrNvaNvPvaWUq1rqXEg1Ah+/Em1fSn224e1U/08WtbXrcqozGa9v2+zCtr6/\nler/0Pi6zoF81pfn7FLKUy3r19avPf3wFPAMMCnJRgOoX4PI0/jqr1+WUm5qT0zy9g62/R+qP8Q/\nTvIQVZD+HtUfqxUFhXH161295P2O6gvE+sA6wHCqUUq7e/qo+w+9JSbZGvg81cinfdLehm3rz5RS\nHm5LexSYVZY+ZdyTPpLlG1e/9na8d9ZtGogtgPvqAN/q7n7s/3dUX3g2pxp9/gPVKfVuWpn+3Irq\ntHJftx+OaVt/oJcyjwCv6LCtnRhN1cZ/r5eVaVdPoB1D/z/ry7PU/kopj9RTCTaq1xfWl5Y+B8xO\ncjPwY6qzan9sr0yrF4O9njOllFn1NcbJwB71sj/wkyR7rcQoMCsustxyC5YpWN27/jPgSeDjwH11\nuU2pTjO3nxV7to+624PqitrSiZXdtrd+7k+dvZVd0b9dX/lr9JG+Mv35AqovJIf3UXbeAOpcWT2f\nl/Op5oX0pj1or0y7BtL2Fe6vlHJakoupLidNBj4FfDzJ3qWUqwewTz1HDPZ6TpXqVrwfAT+qT22e\nTDX7fSfgBvoOCjPr15dRTUhq9TKqEc9jVKdpFwBb9lJHb2l92ZVqNDaplNJzOpMkk/tRx8qYWb++\njGVHaS9j4M8gmAlsl2SNttH9VsvZf7uXUf079Yzm7qOatLU8j9Sv7WdExq1gu4G4j+oSxE9LKX19\naeivlX1g1Fyqz+earZcMVtIc+vdZX+mHXgGUaiLs6cDpSV4E/Bo4lmrip1ZTXrPXcybJC1vX65F8\nz4z0ntOwT7St97iF6oEgH0jLffAtZwp+XCqLqGaQ/1OSLVrKjWLZGfTL0xMIW2+JegFwZD/qWBlX\nAQuBw+trsz1teCvVKeofDbDeH1Gd/l3cF3X9h7UWKqXMourz/Vvvvkh1299+wPRSSk8A/y7wj+nl\ncastt5T9vn7dta3IhwZ4HMtzAfD3VLPv29uzZstcg/7o63PZkfpz+V3gX5IsM9+jnhMybAB19uez\n/gQDbH9d7zppewZFKeVPVF86RraU2zjJy/p7PFq1HNnrufT1+g/R1VSTqjal+oP8MEsmA/2K6hTu\ncanuiV4A3FxKuT/VfdbfAm5IMo0lt949SnU6sccJwO7Az5P8D9WkooOpRqsj6WyEcwPVTPRvJvly\nXcfbqWaFr3KllHmp7uM+Gbgqyfeobvk7jOo4Th1g1WcBHwS+Ud9D/QdgH3q/z/5Iqi8dNyU5i+qL\nzweo7ow4uqXc5+s6vlWf+fgl1Qh+T+B44NpSyu+S/Bz4TD25azbwTyy5/txN36K6nfCMJK+n+mwV\nqtsA3051i9kF/azzXqqR+QeSzKe6I+GOUsod/ajjWKpJeDekerbB7VT9vg1V/72E6gttf/Tnsz4D\nOCTJCVRni+aXUi7px762An6a5EKqyZILqe5weDnVbXo9TgYOoLqlcWY/j0erymDfDuDy/FhYcuvU\njn3kf40V33q3D3AZ1R+0hVSngc8BXty23UFUf1z/xrK3L+1NNat6AdUT474PvKyX9ryuLreQauLR\nMSy57ezvW8rNBC7r45heC/ycakQ0m+o+5W17adM5wFO9bH8ZMLOX9FuBmzrs9/dRBYWFVNeazwU2\nbSvT8a13dflNqUaZ86lu0foG1WS0ZW4to7oP/Gd1H8ynCv7LfAaoAssZVF/inqa6N/9bwCYtZbag\nunf+yfpYvgJsvSr6k2ouwBF1Xs9n5TdU97Rv2lKu13//ug0z29LeWv9bPF23ecpy+nhcH/35Qqpb\nUO+v65lTf8aOBv6ubdtje6l3JnDOAD/rY6kmUj5a582s0yfV6/v2sr/Fx1m3/ctU8yEep/ryM4OW\nWyFb+m6Z2yxdBnfpuQdWarwkZwCHACPKsrPRpcbws652XrNXI7VfW0wymurX4K7zj5+axM+6OuE1\nezXVzCT/S3X/+CZUp8NHACcNaquk7vOzrhUy2KupfkQ1SWss1bX/GVSPc71hUFsldZ+fda2Q1+wl\nSWo4r9lLktRwjTmNP2rUqDJu3LjBbsZq5Znbql/gHPaKbj7iW5K0OrjlllvmlVJGd1K2McF+3Lhx\nzJgxY7CbsVp5cNMXAbCp/SJJjZOktx9L6pWn8SVJajiDvSRJDWewlySp4Qz2kiQ1nMFekqSGM9hL\nktRwHQX7JMcl+WWSx5LMTXJJkm3ayiTJlCQPJVmQ5JokW7eVGZlkWpJH62Va/ZvlrWW2TXJtXceD\nSY5PkpU/VEmShqZOR/aTqH7LeyfgDVTPX74qyUYtZY4BjqL6HeXXUP1O85VJ1mspcx6wPbAnsEf9\nflpPZpL1gSupfjv8NcB/AB8FjuzncUmSpFpHD9Uppezeup7kPcCjwP8DLqlH3kcAny2lXFSXOYAq\n4L8bODPJy6kC/M6llOl1mUOB65O8tJRyN7AfsA5wQCllAXBHvd2RSU4rPshfkqR+G+g1+/XqbR+p\n119M9YtLV/QUqIP1dVRnAwAmAvOB6S313AA80Vbm+nrbHpdT/WzjuAG2VZKkIW2gwf4M4Fbgxnp9\nbP06u63c7Ja8scDc1tF5/X5OW5ne6mjdx2JJDkkyI8mMuXPnDuQ4JElqvH4H+ySnATsD+5RSFrVl\nt59mT1tab6fhV1QmfaRTSplaSplQSpkwenRHvwUgSdKQ069gn+SLwLuAN5RS/tCSNat+bR99j2HJ\nyHwWMKZ1Zn39fnRbmd7qgGVH/JIkqQMd/+pdkjOAfYFJpZS72rLvpwrUk4Ff1uXXBnahmk0P1Sn/\nEVTX5Xuu208E1m1ZvxE4JcnapZSn6rTJwEPAzI6PajW04wmXd73Om07cfcWFJElDXqf32X8FeC/V\nqP6RJGPrZQQsvvZ+OnBskrfV9+CfQzUh77y6zJ3AZVQz83dMMhE4E7i0nolPXfZJ4Jwk2yR5G3As\n4Ex8SZIGqNOR/Qfr16vb0k8EptTvPwcMB74CjARuBt5USnm8pfx+wJdYMmv/h8CHezJLKY8mmVzX\nMYNqtv8XgNM6bKckSWrT6X32K3yCXT3ynsKS4N9bmb8C/7aCem4HXtdJuyRJ0or5bHxJkhrOYC9J\nUsMZ7CVJajiDvSRJDWewlySp4Qz2kiQ1nMFekqSGM9hLktRwBntJkhrOYC9JUsMZ7CVJajiDvSRJ\nDWewlySp4Qz2kiQ1nMFekqSGM9hLktRwBntJkhrOYC9JUsMZ7CVJajiDvSRJDWewlySp4Qz2kiQ1\nnMFekqSGM9hLktRwBntJkhrOYC9JUsMZ7CVJajiDvSRJDddRsE/yuiQ/TPJgkpLkwLb80sfylZYy\n5/SSf1NbPWsl+XKSeUmeqPe5WVeOVJKkIarTkf0I4A7gcGBBL/kbty1vrdO/01buqrZyb27LPx3Y\nB3gXsAuwPnBpkjU6bKckSWqzZieFSik/Bn4M1Qi9l/xZretJ9gbuKaVc21Z0YXvZlm02AA4C3ltK\nubJOew/wALAbcHknbZUkSUvr+jX7JCOAfYGzesneOcmcJPckOSvJmJa8VwPDgCt6EkopfwLuBHbq\ndjslSRoqVsUEvXcDawHfbEu/DNgfeCNwFLAD8NMka9X5Y4FFwLy27WbXectIckiSGUlmzJ07t0vN\nlySpWTo6jd9PBwPfL6UsFX1LKRe0rN6e5BaqU/R7Ad9bTn0BSm8ZpZSpwFSACRMm9FpmoHY8wasG\nkqRm6OrIPsl2wAR6P4W/lFLKQ8CfgS3rpFnAGsCotqJjqEb3kiRpALp9Gv8QYCbVrPvlSjIK2BR4\nuE66BXgGmNxSZjPg5cD0LrdTkqQho6PT+PWku5fUqy8ANq9H8X8tpfyxLrMOsB/wuVJK6WX7KcBF\nVMF9HHAyMAe4GKCU8miSbwCfTzIH+AtwGnAbHXx5kCRJvet0ZD8B+HW9DAdOrN+f1FLmncC6wNm9\nbL8I2Bb4AXAP1eS9u4GJpZTHW8p9hOr6/beBG4D5wFtLKYs6bKckSWrT6X3211BNlFtembPpPdBT\nSlkA7N7Bfp4CDqsXSZLUBT4bX5KkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM\n9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM9pIkNZzBXpKkhjPY\nS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAv\nSVLDdRTsk7wuyQ+TPJikJDmwLf+cOr11uamtzFpJvpxkXpIn6vo2ayuzeZJL6vx5Sb6U5O9W+igl\nSRrCOh3ZjwDuAA4HFvRR5ipg45blzW35pwP7AO8CdgHWBy5NsgZA/fojYL06/13A24EvdNhGSZLU\nizU7KVRK+THwY6hG8X0UW1hKmdVbRpINgIOA95ZSrqzT3gM8AOwGXA68Cdga2KKU8qe6zDHA15N8\nopTyWKcHJUmSlujmNfudk8xJck+Ss5KMacl7NTAMuKInoQ7odwI71UkTgTt7An3tcmCtentJkjQA\n3Qr2lwH7A28EjgJ2AH6aZK06fyywCJjXtt3sOq+nzOy2/Hn1dmPpRZJDksxIMmPu3LkrfRCSJDVR\nR6fxV6SUckHL6u1JbqE6Rb8X8L3lbBqgtFbV1y762O9UYCrAhAkT+tpWkqQhbZXceldKeQj4M7Bl\nnTQLWAMY1VZ0DEtG87NYdgQ/qt6ufcQvSZI6tEqCfZJRwKbAw3XSLcAzwOSWMpsBLwem10k3Ai9v\nux1vMrCw3l6SJA1AR6fxk4wAXlKvvgDYPMl2wF/rZQpwEVVwHwecDMwBLgYopTya5BvA55PMAf4C\nnAbcRnXLHlST934LnJvkKOCFwOeBs5yJL0nSwHU6sp8A/LpehgMn1u9PoppAty3wA+Ae4JvA3cDE\nUsrjLXV8hOr6/beBG4D5wFtLKYsA6te9gCfr/G/X5Y8e+OFJkqRO77O/hmoyXV9276COp4DD6qWv\nMn8E3tJJmyRJUmd8Nr4kSQ1nsJckqeEM9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4\ng70kSQ1nsJckqeEM9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM\n9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM9pIkNZzBXpKkhjPY\nS5LUcGt2UijJ64CjgVcDmwDvLaWcU+cNAz4N7An8A/AY8DPg2FLKH1vquAZ4fVvV3y6l7NtSZiTw\nJeCf6qQfAoeVUv6vvwc2FOx4wuXLzb+ow3I9bjpx95VskSRpddTpyH4EcAdwOLCgLW8dYHvgM/Xr\n3sCLgMuStH+ZOBvYuGU5tC3/vLqOPYE96vfTOmyjJEnqRUcj+1LKj4EfAyQ5py3vUWBya1qSQ4Hf\nAi8Hbm/JerKUMqu3fSR5OVWA37mUMr2lnuuTvLSUcncnbZUkSUtbVdfs169fH2lL3zfJvCS/TXJq\nkvVa8iYC84HpLWk3AE8AO/W2kySHJJmRZMbcuXO71XZJkhqlo5F9fyT5O+ALwCWllD+3ZJ0HPAA8\nBGwNnAy8kiVnBcYCc0sppWeDUkpJMqfOW0YpZSowFWDChAmltzKSJA11XQ329TX6/wU2ZMkkO2Bx\nYO5xe5I/ADcn2b6U8queYr1V20e6JEnqQNdO49eB/nzgFcAbSyl/WcEmM4BFwJb1+ixgTJK01Blg\nNDC7W+2UJGmo6Uqwr2+/+zZVoN+1r0l4bbYF1gAertdvpJr1P7GlzERgXZa+ji9Jkvqh0/vsRwAv\nqVdfAGyeZDvgr1TX4C8EXgO8FShJeq6xP1pKWZDkH4D9qGb0zwP+keq6/q+pJuFRSrkzyWXAmUkO\npjp9fyZwqTPxJUkauE5H9hOoAvOvgeHAifX7k4DNqO6t3wS4hWqk3rO8s97+aeCNwOXA3VQPzrkC\n2K2UsqhlP/sBv6nzLq/fv2dghyZJkqDz++yvoRpp92V5eZRS/sSyT8/rrdxfgX/rpE2SJKkzPhtf\nkqSGM9hLktRwBntJkhrOYC9JUsMZ7CVJajiDvSRJDWewlySp4Qz2kiQ1nMFekqSGM9hLktRwBntJ\nkhrOYC9JUsMZ7CVJajiDvSRJDWewlySp4Qz2kiQ1nMFekqSGM9hLktRwBntJkhrOYC9JUsMZ7CVJ\najiDvSRJDWewlySp4Qz2kiQ1nMFekqSGM9hLktRwBntJkhrOYC9JUsN1FOyTvC7JD5M8mKQkObAt\nP0mmJHkoyYIk1yTZuq3MyCTTkjxaL9OSbNhWZtsk19Z1PJjk+CRZ6aOUJGkI63RkPwK4AzgcWNBL\n/jHAUcBhwGuAOcCVSdZrKXMesD2wJ7BH/X5aT2aS9YErgdl1Hf8BfBQ4svPDkSRJ7dbspFAp5cfA\njwGSnNOaV4+8jwA+W0q5qE47gCrgvxs4M8nLqQL8zqWU6XWZQ4Hrk7y0lHI3sB+wDnBAKWUBcEe9\n3ZFJTiullJU+WkmShqBuXLN/MTAWuKInoQ7W1wE71UkTgfnA9JbtbgCeaCtzfb1tj8uBTYBxXWin\nJElDUjeC/dj6dXZb+uyWvLHA3NbRef1+TluZ3upo3cdSkhySZEaSGXPnzh1g8yVJarZuzsZvP82e\ntrTeTsOvqEz6SK8SS5laSplQSpkwevTo/rRVkqQhoxvBflb92j76HsOSkfksYEzrzPr6/ei2Mr3V\nAcuO+CVJUoe6EezvpwrUk3sSkqwN7MKSa/Q3Us3on9iy3URg3bYyu9Tb9pgMPATM7EI7JUkakjq9\nz35Eku2SbFdvs3m9vnl97f104Ngkb0uyDXAO1YS88wBKKXcCl1HNzN8xyUTgTODSeiY+ddkngXOS\nbJPkbcCxgDPxJUlaCZ2O7CcAv66X4cCJ9fuT6vzPAacBXwFmABsDbyqlPN5Sx37Ab6hm7V9ev39P\nT2Yp5VGqkfwmdR1fAb5Q1ytJkgao0/vsr2HJZLne8gswpV76KvNX4N9WsJ/bgdd10iZJktQZn40v\nSVLDGewlSWo4g70kSQ1nsJckqeEM9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70k\nSQ1nsJckqeEM9pIkNZzBXpKkhjPYS5LUcAZ7aRVJsszyta99baky3/nOd9huu+1YZ5112GKLLfj8\n5z/fcf2lFPbYYw+S8N3vfndx+syZMznooIMYP348w4cPZ/z48Rx33HEsWLCga8cm6fllzcFugNRk\nZ511Fm95y1sWr2+wwQaL3//kJz/h3e9+N1/60pfYY489uPPOOzn44IMZPnw4H/7wh1dY9xe+8AXW\nWGONZdLvuusuFi1axP/8z/+w5ZZbcuedd3LIIYfwl7/8halTp3bnwCQ9rziy13Ni0qRJfOADH+Co\no45io402YvTo0ZxxxhksXLiQD33oQ2y44YZsvvnmTJs2bantHnzwQfbdd19GjhzJyJEj2Wuvvbj3\n3nsX5993333svffejB07lnXXXZftt9+eSy+9dKk6xo0bx6c//WkOPfRQ1l9/fTbbbLN+jaBXxoYb\nbsjYsWMXL8OHD1+cN23aNN761rfywQ9+kPHjx7PXXntx3HHHccopp1BKWW69M2bM4IwzzuDss89e\nJm+PPfbgnHPOYffdd19c7yc+8Qkuuuiirh+fpOcHg72eM9/61rdYb731uPnmmzn22GM54ogj+Od/\n/me22morZsyYwQEHHMD73vc+HnroIQCefPJJdt11V9Zee22uvfZabrzxRjbeeGN22203nnzySQDm\nz5/PnnvuyZVXXslvfvMb9tlnH972trdx1113LbXvL37xi2y77bb86le/4mMf+xjHHHMMN954Y59t\nvf766xkxYsRyl//6r/9a4TEffvjhjBo1ite85jV87Wtf49lnn12ct3DhQtZee+2lyg8fPpw///nP\nPPDAA33W+fjjj/Oud72LM888kzFjxqywDQCPPfYYI0eO7KispObxNL6eM1tvvTVTpkwB4Mgjj+Sz\nn/0sw4YN4/DDDwfg+OOP55RTTmH69Om8/e1v54ILLqCUwtlnn00SgMUB7tJLL+Ud73gHr3zlK3nl\nK1+5eB+f+MQnuOSSS/jud7/LJz/5ycXpb3rTmxafGj/ssMP40pe+xNVXX83EiRN7beuECRO49dZb\nl3s8G2200XLzTzrpJHbddVdGjBjB1VdfzVFHHcW8efMWt2v33Xfn8MMP54orrmC33Xbj97//PV/4\nwhcAePjhhxk3blyv9b7//e9njz324M1vfvNy99/jj3/8I6eeeiof//jHOyovqXkM9nrOvOIVr1j8\nPgljxoxh2223XZw2bNgwRo4cyZw5cwC45ZZbuP/++1lvvfWWqufJJ5/kvvvuA+CJJ57gxBNP5NJL\nL+Xhhx/mmWee4amnnlpqX+37Bthkk00W76c3w4cP5yUvecnADrT2qU99avH77bbbjkWLFvGZz3xm\ncbA/+OCDF1+GeOaZZ1h//fU5/PDDmTJlSq/X4qE69f+b3/yGGTNmdNSG2bNns/vuuzN58mQ+8pGP\nrNTxSHr+8jS+njPDhg1baj1Jr2k9p7qfffZZtttuO2699dallnvuuYdDDz0UgKOPPpoLL7yQ//zP\n/+Taa6/l1ltvZYcdduDpp59e4b5bT6m369Zp/Favfe1reeyxx5g9e/biNpxyyinMnz+fBx54gFmz\nZrHDDjsA9Dmqv/rqq/nd737HiBEjWHPNNVlzzer7+jvf+U523nnnpcrOmjWLXXfdlW222YZp06Yt\nPjsiaehxZK/V1vbbb8/555/PqFGj2HDDDXst8/Of/5z999+fffbZB4CnnnqK++67j6222mql9t2N\n0/jtbr31VtZee+1ljmWNNdZhGoCyAAASB0lEQVRg0003BeD8889n4sSJfV6L/8xnPsPRRx+9VNq2\n227Lqaeeyt5777047eGHH2bXXXdl66235vzzz1/8pUDS0ORfAK229ttvv8VB7KSTTmLzzTfnT3/6\nEz/4wQ94//vfz5ZbbslWW23FxRdfzN57782wYcM48cQTeeqpp1Z63yt7Gv+SSy5h1qxZTJw4keHD\nh/Ozn/2M448/nkMOOYS11loLgHnz5nHhhRcyadIkFi5cyNlnn82FF17Itddeu7ieX/ziF+y///6c\ne+657LDDDmy66aaLvxi0etGLXsT48eMBeOihh5g0aRKbbLIJp59+OvPmzVtcbvTo0X1eIpDUXAZ7\nLbbjCZd3tb6bTtx9pbZfZ511uO666zj22GP513/9Vx599FE22WQTdt1118Uzy0877TQOOuggdtll\nF0aOHMkRRxzRlWC/soYNG8ZXv/pVjjzySJ599lnGjx/PSSedxIc+9KGlyp177rl89KMfpZTCxIkT\nueaaaxafyodqfsLdd9+9+O6DTlxxxRXce++93HvvvWy++eZL5d1///19XiKQ1FxZ0f28zxcTJkwo\nnU5a6kS3A99guOjr7wNgn/d9fVD2v7LBXpLUtyS3lFImdFLWCXqSJDWcwV6SpIbrSrBPMjNJ6WX5\nUZ0/pZe8WW11pC73UJIFSa5JsnU32idJ0lDWrZH9a4CNW5btgQJ8p6XM3W1ltm2r4xjgKOCwur45\nwJVJ1kOSJA1YV2bjl1Lmtq4nOQh4DLiwJflvpZSlRvMt5QMcAXy2lHJRnXYAVcB/N3BmN9opSdJQ\n1PVr9nXgPgj431JK6/1C45M8mOT+JBckGd+S92JgLHBFT0IpZQFwHbBTt9soSdJQsiom6E2mCt6t\n93vdDBwI7AkcTBXYpyd5YZ0/tn6d3VbX7Ja8ZSQ5JMmMJDPmzp3bVzFJkoa0VRHsDwZ+WUpZ/KzR\nUspPSinfKaXcVkq5CnhLve8D2rZtv+k/vaQtKVzK1FLKhFLKhNGjR3ep+ZIkNUtXg32SMcDewFnL\nK1dKmQ/8FtiyTuq5lt8+ih/DsqN9SZLUD90e2R8ILAQuWF6hJGsDLwMerpPupwr4k9vK7AJM73Ib\nJUkaUrr2bPx6Yt77gAtKKY+35Z0KXAL8kWq0/ilgXeCbAKWUkuR04BNJ7gLuAT4JzAfO61YbJUka\nirr5QziTqE7L/1sveZsB5wOjgLnATcCOpZQHWsp8DhgOfAUYSTWp703tXxwkSVL/dC3Yl1J+RjWh\nrre8fTvYvgBT6kWSJHWJz8aXJKnhDPaSJDWcwV6SpIYz2EuS1HAGe0mSGs5gL0lSwxnsJUlqOIO9\nJEkNZ7CXJKnhDPaSJDWcwV6SpIYz2EuS1HAGe0mSGs5gL0lSwxnsJUlqOIO9JEkNZ7CXJKnhDPaS\nJDWcwV6SpIYz2EuS1HAGe0mSGs5gL0lSwxnsJUlqOIO9JEkNZ7CXJKnhDPaSJDWcwV6SpIYz2EuS\n1HBdCfZJpiQpbcuslvzUZR5KsiDJNUm2bqtjZJJpSR6tl2lJNuxG+yRJGsq6ObK/G9i4Zdm2Je8Y\n4CjgMOA1wBzgyiTrtZQ5D9ge2BPYo34/rYvtkyRpSFqzi3X9rZQyqz0xSYAjgM+WUi6q0w6gCvjv\nBs5M8nKqAL9zKWV6XeZQ4PokLy2l3N3FdkqSNKR0c2Q/PsmDSe5PckGS8XX6i4GxwBU9BUspC4Dr\ngJ3qpInAfGB6S303AE+0lJEkSQPQrWB/M3Ag1Sn4g6mC+/QkL6zfA8xu22Z2S95YYG4ppfRk1u/n\ntJRZRpJDksxIMmPu3LndOA5JkhqnK6fxSyk/aV1PchPwB+AA4KaeYm2bpS2tPb+3Mu37nQpMBZgw\nYUKf5SRJGspWya13pZT5wG+BLYGe6/jtI/QxLBntzwLG1Nf3gcXX+kez7BkBSZLUD6sk2CdZG3gZ\n8DBwP1Uwn9yWvwtLrtHfCIygunbfYyKwLktfx5ckSf3UldP4SU4FLgH+SDVi/xRVoP5mKaUkOR34\nRJK7gHuAT1JNyDsPoJRyZ5LLqGbmH0x1+v5M4FJn4kuStHK6devdZsD5wChgLtV1+h1LKQ/U+Z8D\nhgNfAUZSTeh7Uynl8ZY69gO+xJJZ+z8EPtyl9kmSNGR1a4LevivIL8CUeumrzF+Bf+tGe7R62PGE\ny7ta300n7t7V+iRpqPDZ+JIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJck\nqeEM9pIkNZzBXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM9pIkNZzBXpKk\nhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ1nsJckqeEM9pIkNZzBXpKkhjPYS5LUcAZ7SZIa\nbs3BboDUqR1PuLzrdd504u5dr1OSVjddGdknOS7JL5M8lmRukkuSbNNW5pwkpW25qa3MWkm+nGRe\nkieS/DDJZt1ooyRJQ1W3TuNPAr4K7AS8AfgbcFWSjdrKXQVs3LK8uS3/dGAf4F3ALsD6wKVJ1uhS\nOyVJGnK6chq/lLLUudAk7wEeBf4fcElL1sJSyqze6kiyAXAQ8N5SypUt9TwA7AZ0/xyuJElDwKqa\noLdeXfcjbek7J5mT5J4kZyUZ05L3amAYcEVPQinlT8CdVGcMJEnSAKyqYH8GcCtwY0vaZcD+wBuB\no4AdgJ8mWavOHwssAua11TW7zltGkkOSzEgyY+7cuV1sviRJzdH12fhJTgN2BnYupSzqSS+lXNBS\n7PYkt1Cdot8L+N7yqgRKbxmllKnAVIAJEyb0WkaSpKGuqyP7JF+kmlz3hlLKH5ZXtpTyEPBnYMs6\naRawBjCqregYqtG9JEkagK4F+yRnAO+mCvR3dVB+FLAp8HCddAvwDDC5pcxmwMuB6d1qpyRJQ01X\nTuMn+QrwHuCfgUeS9Fxjn19KmZ9kBDAFuIgquI8DTgbmABcDlFIeTfIN4PNJ5gB/AU4DbqO6ZU+S\nJA1At67Zf7B+vbot/USqIL8I2JZqgt6GVAH/Z8A7SimPt5T/CNU9+t8Ghtf17d967V+SJPVPt+6z\nzwryFwArfC5pKeUp4LB6kSRJXeAP4UiS1HAGe0mSGs5gL0lSwxnsJUlqOH/PXkPajid09/eVbjpx\nhfNQJek558hekqSGM9hLktRwBntJkhrOYC9JUsMZ7CVJajiDvSRJDWewlySp4Qz2kiQ1nMFekqSG\nM9hLktRwPi5X6iIfvytpdeTIXpKkhjPYS5LUcAZ7SZIazmAvSVLDGewlSWo4g70kSQ3nrXfSaqzb\nt/KBt/NJQ5Eje0mSGs6RvTTE+OAfaehxZC9JUsMZ7CVJajiDvSRJDbdaBvskH0xyf5KnktySZJfB\nbpMkSc9Xq90EvSTvBM4APgj8vH79SZJ/LKX8cVAbJ2kZq+L2wG5yAqG0GgZ74EjgnFLKWfX6YUn2\nAD4AHDd4zZL0fOSzCqTVLNgn+Tvg1cCpbVlXADs99y2SpFXv+XA75POhjepbSimD3YbFkmwCPAi8\nvpRyXUv68cB+pZSXtpU/BDikXn0pcHcXmzMKmNfF+oYy+7K77M/usS+7y/7snk76cotSyuhOKlut\nRvYt2r+BpJc0SilTgamrogFJZpRSJqyKuoca+7K77M/usS+7y/7snm735eo2G38esAgY25Y+Bpj9\n3DdHkqTnv9Uq2JdSngZuASa3ZU0Gpj/3LZIk6flvdTyNfxowLckvgBuA9wObAF97jtuxSi4PDFH2\nZXfZn91jX3aX/dk9Xe3L1WqCXo8kHwSOATYG7gA+0jphT5IkdW61DPaSJKl7Vqtr9pIkqfsM9m18\nLn9nkrwuyQ+TPJikJDmwLT9JpiR5KMmCJNck2bqtzMgk05I8Wi/Tkmz4nB7IaiDJcUl+meSxJHOT\nXJJkm7Yy9mcHknwoyW11Xz6W5MYke7Xk248DlOTj9f/1/25Jsz87VPdTaVtmteSv0r402LdoeS7/\nfwGvoroD4CdJNh/Uhq2eRlDNpzgcWNBL/jHAUcBhwGuAOcCVSdZrKXMesD2wJ7BH/X7aKmzz6moS\n8FWqp0S+AfgbcFWSjVrK2J+d+TPwMapjnwD8FPh+klfU+fbjACTZETgYuK0ty/7sn7up5qL1LNu2\n5K3aviyluNQLcDNwVlvavcDJg9221XkB5gMHtqwHeBj4REvacOBx4NB6/eVUD0r6fy1ldq7TXjrY\nxzTI/TmC6nkTb7U/u9KffwUOtR8H3H8bAPdRfRG9BvjvOt3+7F8/TgHu6CNvlfelI/tay3P5r2jL\n8rn8/fdiqgcjLe7LUsoC4DqW9OVEqi8Jrc9PuAF4Avt7Paqzbo/U6/bnACRZI8m+VF+epmM/DtRU\n4LullJ+2pduf/Te+vvR5f5ILkoyv01d5XxrslxgFrMGyT+qbzbJP9NPy9fTX8vpyLDC31F9PAer3\nc7C/zwBuBW6s1+3PfkiybZL5wEKq53P8SynlduzHfktyMPAS4FO9ZNuf/XMzcCDVKfiDqY5/epIX\n8hz05er4UJ3B1tFz+dWRFfVlb/06pPs7yWlUp+Z2LqUsasu2PztzN7AdsCGwD/DNJJNa8u3HDiR5\nKdX8pV1K9XTTvtifHSil/KR1PclNwB+AA4Cbeoq1bda1vnRkv4TP5e+enhmmy+vLWcCYJOnJrN+P\nZoj2d5IvAu8C3lBK+UNLlv3ZD6WUp0spvy+lzCilHEd1luQj2I/9NZHqjOcdSf6W5G/A64EP1u//\nUpezPweglDIf+C2wJc/BZ9NgXys+l7+b7qf6YC7uyyRrA7uwpC9vpLqWOrFlu4nAugzB/k5yBvBu\nqkB/V1u2/blyXgCshf3YX9+nmi2+XcsyA7igfn8P9ueA1X31MqqJeav+sznYMxRXpwV4J/A08D6q\nmY9nUE2I2GKw27a6LfWHrucPwJPA8fX7zev8jwGPAW8DtqH6A/EQsF5LHT8Bbgd2rD+0twOXDPax\nDUJffqXuqzdQfbPvWUa0lLE/O+vLz9Z/IMdRBaqTgWeBPe3HrvTvNdSz8e3PfvfdqVRnRl4MvBa4\ntO67LZ6Lvhz0DljdFuCDwEyqyT23AK8b7DatjgvVveGll+WcOj9Ut5o8DDwFXAts01bHRsD/1h/w\nx+r3Gw72sQ1CX/bWjwWY0lLG/uysL88BHqj//84BrgJ2tx+71r/twd7+7LzveoL308CDwEXAPz5X\nfemz8SVJajiv2UuS1HAGe0mSGs5gL0lSwxnsJUlqOIO9JEkNZ7CXJKnhDPaSBk2ScUlKkgmD3Rap\nyQz20hCQ5Jwklw71NkhDlcFekqSGM9hLQ1ySDZJMTTInyeNJrm09rZ7kwCTzk7wxyR1JnkjysyQv\nbqvnuCSz67LnJjkhycw6bwrVT3nuVZ+2L20/O7tFkiuTPJnkd0naf5BK0kow2EtDWP0TmT8CNgXe\nArwKuA74aZKNW4quBRwH/DvVD3BsCHytpZ59gROATwDbA3cCR7ZsfyrwHapn1W9cL62/1PUZ4EvA\nK4FfAhckGdGt45SGOoO9NLTtSvVrhW8vpfyiVL8D/yngD8B7WsqtCXyoLnMbVfDeNUnP35DDqX4E\n6eullHtKKScDN/dsXKrf7l4ALCylzKqXp1vq/2Ip5ZJSyr3Ax6l+8GO7VXTM0pBjsJeGtlcD6wBz\n69Pv85PMp/qJzX9oKbewlHJ3y/pDwDCqET5Uv8v9i7a6b6Zzt7XVDTCmH9tLWo41B7sBkgbVC4DZ\nVL8B3+6xlvd/a8vr+bnMF/SSNhDPLK6klFJdXXAwInWLwV4a2n4F/D3wbCnlDytRz13ADsDZLWk7\ntJV5GlhjJfYhaYAM9tLQsX6S9uvgvwduAH6Q5BiqoD0W2AO4qpRyfYd1nwGcneSXwPXAvwCvBR5p\nKTMT2DPJS4G/AI8O9EAk9Y/BXho6dgF+3ZZ2EfBm4NPAWVTXyWdTfQE4t9OKSykXJBkPfJZqDsD3\nqGbr791S7CxgEjADGEE1OXBm/w9DUn+llJW5zCZJvUtyMbBmKeWtg90WaahzZC9ppSVZB/gAcBnV\nZL59qEb1+wxmuyRVHNlLWmlJhgOXUD2UZzhwL/C5Usq3BrVhkgCDvSRJjed9rJIkNZzBXpKkhjPY\nS5LUcAZ7SZIazmAvSVLDGewlSWq4/x9rjq7V+SAh4gAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f82161dde48>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot.\n",
"plt.rc('figure', figsize=(8,6))\n",
"plt.rc('font', size=14)\n",
"plt.rc('lines', linewidth=2)\n",
"plt.rc('axes', color_cycle=('#377eb8','#e41a1c','#4daf4a',\n",
" '#984ea3','#ff7f00','#ffff33'))\n",
"# Histogram.\n",
"plt.hist(lens, bins=20)\n",
"plt.hold(True)\n",
"# Average length.\n",
"avg_len = sum(lens) / float(len(lens))\n",
"plt.axvline(avg_len, color='#e41a1c')\n",
"plt.hold(False)\n",
"plt.title('Histogram of document lengths.')\n",
"plt.xlabel('Length')\n",
"plt.text(100, 800, 'mean = %.2f' % avg_len)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to initialize the similarity class with a corpus and a word2vec model (which provides the embeddings and the `wmdistance` method itself)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from gensim.models import Word2Vec"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Train Word2Vec on all the restaurants.\n",
"model = Word2Vec(w2v_corpus, workers=3, size=100)\n",
"\n",
"# Initialize WmdSimilarity.\n",
"from gensim.similarities import WmdSimilarity\n",
"num_best = 10\n",
"instance = WmdSimilarity(wmd_corpus, model, num_best=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `num_best` parameter decides how many results the queries return. Now let's try making a query. The output is a list of indeces and similarities of documents in the corpus, sorted by similarity.\n",
"\n",
"Note that the output format is slightly different when `num_best` is `None` (i.e. not assigned). In this case, you get an array of similarities, corresponding to each of the documents in the corpus.\n",
"\n",
"The query below is taken directly from one of the reviews in the corpus. Let's see if there are other reviews that are similar to this one."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Cell took 30.82 seconds to run.\n"
]
}
],
"source": [
"start = time()\n",
"\n",
"sent = 'Very good, you should seat outdoor.'\n",
"query = preprocess(sent)\n",
"\n",
"sims = instance[query] # A query is simply a \"look-up\" in the similarity class.\n",
"\n",
"print('Cell took %.2f seconds to run.' %(time() - start))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The query and the most similar documents, together with the similarities, are printed below. We see that the retrieved documents are discussing the same thing as the query, although using different words. The query talks about getting a seat \"outdoor\", while the results talk about sitting \"outside\", and one of them says the restaurant has a \"nice view\"."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Query:\n",
"Very good, you should seat outdoor.\n",
"\n",
"sim = 0.5969\n",
"It's a great place if you can sit outside in good weather.\n",
"\n",
"sim = 0.5612\n",
"The steak was good. Prices reasonable for the strip and it was a great view with the outdoor seating.\n",
"\n",
"sim = 0.5497\n",
"Best seat in the house with view of water fountain, good wine, good food n good service.\n",
"\n",
"sim = 0.5478\n",
"Sat outside under heat lamps. Good service and good food. Wonderful place\n",
"\n",
"sim = 0.5466\n",
"Lovely restaurant. Food was very good. Drinks very good. View of Bellagio fountains...amazing.\n",
"\n",
"sim = 0.5428\n",
"Good value restaurant on strip! \n",
"Great view take outside seat good food!\n",
"However, be sure you make reservation!\n",
"\n",
"sim = 0.5393\n",
"Very good salmon\n",
"Nice ambience\n",
"Nice view \n",
"Good service\n",
"\n",
"sim = 0.5374\n",
"sit on the patio and people watch.. great time.\n",
"\n",
"sim = 0.5371\n",
"i always bring my visitors here. good food with a good view of the bellagio fountain.\n",
"\n",
"sim = 0.5346\n",
"Good place for dinner with the beautiful view. Outside patio best view of the Trip. Hang out dinner with good kids. My nephews joining us too. They have good foods.\n"
]
}
],
"source": [
"# Print the query and the retrieved documents, together with their similarities.\n",
"print('Query:')\n",
"print(sent)\n",
"for i in range(num_best):\n",
" print()\n",
" print('sim = %.4f' % sims[i][1])\n",
" print(documents[sims[i][0]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try a different query, also taken directly from one of the reviews in the corpus."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Query:\n",
"I felt that the prices were extremely reasonable for the Strip\n",
"\n",
"sim = 0.5459\n",
"The steak was good. Prices reasonable for the strip and it was a great view with the outdoor seating.\n",
"\n",
"sim = 0.5367\n",
"Incredible restaurant on the strip! Very reasonable prices, outstanding service, an breathtaking views. Bar none, my favorite meal on the Strip.\n",
"\n",
"sim = 0.5354\n",
"don't let the tourist location throw you. terrific French food on the strip without the strip prices.\n",
"\n",
"sim = 0.5337\n",
"Great breakfast (wonderful house-made English muffins!), and fair prices for the Strip.\n",
"\n",
"sim = 0.5302\n",
"Good food, great atmosphere, reasonable prices. Right in the middle of the Strip. Nothing not to like here.\n",
"\n",
"sim = 0.5286\n",
"Very delicious breakfast. The food quality is great, location is prime and comparably reasonable prices!\n",
"\n",
"sim = 0.5218\n",
"Great value on the strip and good quality food.\n",
"\n",
"sim = 0.5207\n",
"Prompt and polite service, outstanding Bananas Foster Waffles, and reasonable prices. Definitely a favorite go-to for breakfast on the Strip!\n",
"\n",
"sim = 0.5190\n",
"Really good food at decent prices (for being on the strip). Not a traditional steakhouse but just as good as many of them. Sitting out on the strip is very nice at nighttime.\n",
"\n",
"sim = 0.5149\n",
"Really good food. Good service our waiter was friendly and had a good sense of humour. Service was really fast considering they were packed. Prices are reasonable especially for the quality of food and for being on the Strip.\n",
"\n",
"Cell took 39.25 seconds to run.\n"
]
}
],
"source": [
"start = time()\n",
"\n",
"sent = 'I felt that the prices were extremely reasonable for the Strip'\n",
"query = preprocess(sent)\n",
"\n",
"sims = instance[query] # A query is simply a \"look-up\" in the similarity class.\n",
"\n",
"print('Query:')\n",
"print(sent)\n",
"for i in range(num_best):\n",
" print()\n",
" print('sim = %.4f' % sims[i][1])\n",
" print(documents[sims[i][0]])\n",
"\n",
"print('\\nCell took %.2f seconds to run.' %(time() - start))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time around, the results are more straight forward; the retrieved documents basically contain the same words as the query.\n",
"\n",
"`WmdSimilarity` normalizes the word embeddings by default (using `init_sims()`, as explained before), but you can overwrite this behaviour by calling `WmdSimilarity` with `normalize_w2v_and_replace=False`."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Notebook took 245.48 seconds to run.\n"
]
}
],
"source": [
"print('Notebook took %.2f seconds to run.' %(time() - start_nb))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"1. Ofir Pele and Michael Werman, *A linear time histogram metric for improved SIFT matching*, 2008.\n",
"* Ofir Pele and Michael Werman, *Fast and robust earth mover's distances*, 2009.\n",
"* Matt Kusner et al. *From Embeddings To Document Distances*, 2015.\n",
"* Thomas Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*, 2013."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment