benvandyke/yelptextmining

## yelptextmining
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Text Mining Yelp Reviews to Find Similar Businesses"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Using the scikit-learn TfidfVectorizer Class and the Yelp Acadamic dataset.\n",
      "\n",
      "Ben Van Dyke, December 2013\n",
      "\n",
      "[btvandyke@gmail.com](mailto:btvandyke@gmail.com)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Implementation"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "from sklearn.feature_extraction.text import TfidfVectorizer\n",
      "from __future__ import print_function"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# file paths to data\n",
      "win_bus_path = '../../../Documents/yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json'\n",
      "win_rev_path = '../../../Documents/yelp_phoenix_academic_dataset/yelp_academic_dataset_review.json'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# load the business data\n",
      "businesses = []\n",
      "with open(win_bus_path) as f:\n",
      "    for line in f:\n",
      "        businesses.append(json.loads(line))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# data size\n",
      "print('number of businesses: ', len(businesses))\n",
      "print('number of reviews: ', len(reviews))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "number of businesses:  11537\n",
        "number of reviews:  229907\n"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# example business\n",
      "businesses[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "{u'business_id': u'rncjoVoEFUJGCUoC1JgnUA',\n",
        " u'categories': [u'Accountants',\n",
        "  u'Professional Services',\n",
        "  u'Tax Services',\n",
        "  u'Financial Services'],\n",
        " u'city': u'Peoria',\n",
        " u'full_address': u'8466 W Peoria Ave\\nSte 6\\nPeoria, AZ 85345',\n",
        " u'latitude': 33.581867,\n",
        " u'longitude': -112.241596,\n",
        " u'name': u'Peoria Income Tax Service',\n",
        " u'neighborhoods': [],\n",
        " u'open': True,\n",
        " u'review_count': 3,\n",
        " u'stars': 5.0,\n",
        " u'state': u'AZ',\n",
        " u'type': u'business'}"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# read in reviews\n",
      "reviews = []\n",
      "with open(win_rev_path) as f:\n",
      "    for line in f:\n",
      "        reviews.append(json.loads(line))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# example review\n",
      "reviews[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "{u'business_id': u'9yKzy9PApeiPPOUJEtnvkg',\n",
        " u'date': u'2011-01-26',\n",
        " u'review_id': u'fWKvX83p0-ka4JS3dc6E5A',\n",
        " u'stars': 5,\n",
        " u'text': u'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\\n\\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\\'ve ever had.  I\\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\\n\\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best \"toast\" I\\'ve ever had.\\n\\nAnyway, I can\\'t wait to go back!',\n",
        " u'type': u'review',\n",
        " u'user_id': u'rLtl8ZkDX5vH5nAx9C3q5Q',\n",
        " u'votes': {u'cool': 2, u'funny': 0, u'useful': 5}}"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# combine all the reviews into one string\n",
      "# store in a dict with business id as the key\n",
      "business_ids = [business['business_id'] for business in businesses]\n",
      "business_dict = {business_id : ' ' for business_id in business_ids}\n",
      "\n",
      "for review in reviews:\n",
      "    business_dict[review['business_id']] = ' '.join(\n",
      "        [business_dict[review['business_id']], review['text']])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# labels for use in evauluation\n",
      "bus_id = business_dict.keys()\n",
      "rev = business_dict.values()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# construct the tfidf matrix\n",
      "vectorizer = TfidfVectorizer(min_df=1)\n",
      "%timeit tfidf = vectorizer.fit_transform(rev)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 38.7 s per loop\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# examine the resulting size\n",
      "print(np.shape(tfidf))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(11537, 117024)\n"
       ]
      }
     ],
     "prompt_number": 31
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "11,500 reviews yielded 117,000 terms after tokenizing and removing stop words."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# calculate the cosine similarities\n",
      "%timeit sim = tfidf * tfidf.T"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 2min 15s per loop\n"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For reference I'm testing this on an older Thinkpad T61p with a Core 2 Duo running Ubuntu 12.04. The sklearn vectorizer runs much faster than manually creating a dense TFxIDF matrix in NumPy. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Results"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Helper function that retreives three similar businesses\n",
      "\n",
      "# dictionary of businesses for fast lookup\n",
      "business_dict2 = {business['business_id'] : business for business in businesses}\n",
      "\n",
      "def fetch_business(ind):\n",
      "    b = bus_id[ind]\n",
      "    neighbors = np.argsort(sim[ind].toarray())\n",
      "    print(business_dict2[b]['name'])\n",
      "    print(business_dict2[b]['categories'])\n",
      "    print ()\n",
      "    print ('Similar businesses: ')\n",
      "    \n",
      "    for i in range(3):\n",
      "        n = bus_id[neighbors[0][np.shape(neighbors[0])[0] - i - 2]]\n",
      "        print (business_dict2[n]['name'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fetch_business(8502)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Khai Hoan Restaurant\n",
        "[u'Vietnamese', u'Restaurants']\n",
        "\n",
        "Similar businesses: \n",
        "unPhogettable\n",
        "Pho Van\n",
        "BlueMoon Vietnamese Kitchen\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fetch_business(1000)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Accurate Auto Diagnostics\n",
        "[u'Auto Repair', u'Automotive']\n",
        "\n",
        "Similar businesses: \n",
        "Firestone Complete Auto Care Center\n",
        "Greg's Japanese Auto Parts & Service\n",
        "Mike Vinson Automotive\n"
       ]
      }
     ],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fetch_business(2514)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Water and Ice Discount Superstores\n",
        "[u'Food', u'Shaved Ice']\n",
        "\n",
        "Similar businesses: \n",
        "Pink Spot\n",
        "Mary Coyle's Ol' Fashioned Ice Cream and Yogurt Parlor\n",
        "Sweet Republic\n"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A quick look at the results of the fetch_business function seems to show that the technique is finding similar businesses. It might be interesting to combine the results with other data attributes such as review score and geographic location."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": ""
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Text Mining Yelp Reviews to Find Similar Businesses"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using the scikit-learn TfidfVectorizer Class and the Yelp Acadamic dataset.\n",
	"\n",
	"Ben Van Dyke, December 2013\n",
	"\n",
	"[btvandyke@gmail.com](mailto:btvandyke@gmail.com)"
	]
	},
	{
	"cell_type": "heading",
	"level": 4,
	"metadata": {},
	"source": [
	"Implementation"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import json\n",
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"from __future__ import print_function"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 26
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# file paths to data\n",
	"win_bus_path = '../../../Documents/yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json'\n",
	"win_rev_path = '../../../Documents/yelp_phoenix_academic_dataset/yelp_academic_dataset_review.json'"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# load the business data\n",
	"businesses = []\n",
	"with open(win_bus_path) as f:\n",
	" for line in f:\n",
	" businesses.append(json.loads(line))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 4
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# data size\n",
	"print('number of businesses: ', len(businesses))\n",
	"print('number of reviews: ', len(reviews))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"number of businesses: 11537\n",
	"number of reviews: 229907\n"
	]
	}
	],
	"prompt_number": 37
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# example business\n",
	"businesses[0]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 19,
	"text": [
	"{u'business_id': u'rncjoVoEFUJGCUoC1JgnUA',\n",
	" u'categories': [u'Accountants',\n",
	" u'Professional Services',\n",
	" u'Tax Services',\n",
	" u'Financial Services'],\n",
	" u'city': u'Peoria',\n",
	" u'full_address': u'8466 W Peoria Ave\\nSte 6\\nPeoria, AZ 85345',\n",
	" u'latitude': 33.581867,\n",
	" u'longitude': -112.241596,\n",
	" u'name': u'Peoria Income Tax Service',\n",
	" u'neighborhoods': [],\n",
	" u'open': True,\n",
	" u'review_count': 3,\n",
	" u'stars': 5.0,\n",
	" u'state': u'AZ',\n",
	" u'type': u'business'}"
	]
	}
	],
	"prompt_number": 19
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# read in reviews\n",
	"reviews = []\n",
	"with open(win_rev_path) as f:\n",
	" for line in f:\n",
	" reviews.append(json.loads(line))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 5
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# example review\n",
	"reviews[0]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 20,
	"text": [
	"{u'business_id': u'9yKzy9PApeiPPOUJEtnvkg',\n",
	" u'date': u'2011-01-26',\n",
	" u'review_id': u'fWKvX83p0-ka4JS3dc6E5A',\n",
	" u'stars': 5,\n",
	" u'text': u'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\\n\\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\\'ve ever had. I\\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\\n\\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best \"toast\" I\\'ve ever had.\\n\\nAnyway, I can\\'t wait to go back!',\n",
	" u'type': u'review',\n",
	" u'user_id': u'rLtl8ZkDX5vH5nAx9C3q5Q',\n",
	" u'votes': {u'cool': 2, u'funny': 0, u'useful': 5}}"
	]
	}
	],
	"prompt_number": 20
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# combine all the reviews into one string\n",
	"# store in a dict with business id as the key\n",
	"business_ids = [business['business_id'] for business in businesses]\n",
	"business_dict = {business_id : ' ' for business_id in business_ids}\n",
	"\n",
	"for review in reviews:\n",
	" business_dict[review['business_id']] = ' '.join(\n",
	" [business_dict[review['business_id']], review['text']])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 8
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# labels for use in evauluation\n",
	"bus_id = business_dict.keys()\n",
	"rev = business_dict.values()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 9
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# construct the tfidf matrix\n",
	"vectorizer = TfidfVectorizer(min_df=1)\n",
	"%timeit tfidf = vectorizer.fit_transform(rev)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"1 loops, best of 3: 38.7 s per loop\n"
	]
	}
	],
	"prompt_number": 22
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# examine the resulting size\n",
	"print(np.shape(tfidf))"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"(11537, 117024)\n"
	]
	}
	],
	"prompt_number": 31
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"11,500 reviews yielded 117,000 terms after tokenizing and removing stop words."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# calculate the cosine similarities\n",
	"%timeit sim = tfidf * tfidf.T"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"1 loops, best of 3: 2min 15s per loop\n"
	]
	}
	],
	"prompt_number": 23
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For reference I'm testing this on an older Thinkpad T61p with a Core 2 Duo running Ubuntu 12.04. The sklearn vectorizer runs much faster than manually creating a dense TFxIDF matrix in NumPy. "
	]
	},
	{
	"cell_type": "heading",
	"level": 4,
	"metadata": {},
	"source": [
	"Results"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Helper function that retreives three similar businesses\n",
	"\n",
	"# dictionary of businesses for fast lookup\n",
	"business_dict2 = {business['business_id'] : business for business in businesses}\n",
	"\n",
	"def fetch_business(ind):\n",
	" b = bus_id[ind]\n",
	" neighbors = np.argsort(sim[ind].toarray())\n",
	" print(business_dict2[b]['name'])\n",
	" print(business_dict2[b]['categories'])\n",
	" print ()\n",
	" print ('Similar businesses: ')\n",
	" \n",
	" for i in range(3):\n",
	" n = bus_id[neighbors[0][np.shape(neighbors[0])[0] - i - 2]]\n",
	" print (business_dict2[n]['name'])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 27
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"fetch_business(8502)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Khai Hoan Restaurant\n",
	"[u'Vietnamese', u'Restaurants']\n",
	"\n",
	"Similar businesses: \n",
	"unPhogettable\n",
	"Pho Van\n",
	"BlueMoon Vietnamese Kitchen\n"
	]
	}
	],
	"prompt_number": 28
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"fetch_business(1000)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Accurate Auto Diagnostics\n",
	"[u'Auto Repair', u'Automotive']\n",
	"\n",
	"Similar businesses: \n",
	"Firestone Complete Auto Care Center\n",
	"Greg's Japanese Auto Parts & Service\n",
	"Mike Vinson Automotive\n"
	]
	}
	],
	"prompt_number": 29
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"fetch_business(2514)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Water and Ice Discount Superstores\n",
	"[u'Food', u'Shaved Ice']\n",
	"\n",
	"Similar businesses: \n",
	"Pink Spot\n",
	"Mary Coyle's Ol' Fashioned Ice Cream and Yogurt Parlor\n",
	"Sweet Republic\n"
	]
	}
	],
	"prompt_number": 30
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A quick look at the results of the fetch_business function seems to show that the technique is finding similar businesses. It might be interesting to combine the results with other data attributes such as review score and geographic location."
	]
	}
	],
	"metadata": {}
	}
	]
	}