zhouzhuojie/Yelp Dataset

## Yelp Dataset


### Installation (Ubuntu OS, others similar)

```bash
sudo apt-get install mongodb
sudo pip install pymongo # or sudo easy_install pymongo
```

### Restore the data into mongodb

```bash
mongorestore --db yelp ./dump/yelp
```
The data will be stored into db named `yelp`, under `yelp`, there will be 5 collections:
```
businesses
chekins
reviews
tips
users
```
These are all the meta data that can be used as the estimation.


### Restore the pickle graph
For the yelp user graph, I have preprocessed all the users in `users` collection and store the whole connected undirected graph as `yelp_users.pickle`, so you can access the user graph by
```python
import pickle
graph = pickle.load(open('./yelp_users.pickle', 'r'))
print graph.nodes()[0] // The node's label is unicode string, e.g. u'MtX0WZ4bqMfFuYvtupgRqg'

# Graph info
# Number of nodes: 119839
# Number of edges: 954116
# Average degree:  15.9233
```


At this point, you can run your program as before, the graph is just plain networkx graph. If you want rich feature/info of users, please follow the following code.

### Rich info

Each user is stored as the following structure:
```json
{
  'type': 'user',
  'user_id': (unique user identifier),
  'name': (first name, last initial, like 'Matt J.'),
  'review_count': (review count),
  'average_stars': (floating point average, like 4.31),
  'votes': {
    'useful': (count of useful votes across all reviews),
    'funny': (count of funny votes across all reviews),
    'cool': (count of cool votes across all reviews)
  }
}
```

For example,
```json
{
   "_id":ObjectId("53e282f680183c24b7fb0dda"),
   "yelping_since":"2013-11",
   "votes":{
      "funny":0,
      "useful":2,
      "cool":0
   },
   "review_count":9,
   "name":"Clarinda",
   "user_id":"as22TLsZn_SwVv4oCjxdMg",
   "friends":[
      "ND6DMIKxM8Q1ShEMZuA5rA",
      "M-O0tasOl0SGiUsxdO5cZw",
      "1cDIfb6TSh71n99Ark5n3A",
      "NYhXvxMqbVsB7lcFrgNqow",
      "bFe-sUMDaDm2Q9u5Cve1tQ",
      "7P9okhRRYG0hz02Fk7tRMw",
      "enR0fiE0u_jfmW2x3aqxUA",
      "W4YsRCa1Xq4wrRRf53E5ZQ",
      "b89mmlWnUfrIzpuftP3cgQ",
      "nutDqAZ0fyOmz8yAqbCvFw",
      "BPbrH2VQASzR9Le6oC4DpQ"
   ],
   "fans":0,
   "average_stars":4,
   "type":"user",
   "compliments":{
      "plain":1
   },
   "elite":[
   ]
}
```
So, one may be able to sample users' average number of `average_stars`, `review_count`, `votes.funny`, `votes.useful`, `votes.cool` etc.

Therefore, given a user_id, we can query the mongodb to get the right info for that user.

```python
import pymongo
import networkx as nx
import pickle
import random

# Prepare db and graph
db = pymongo.MongoClient().yelp
graph = pickle.load(open('./yelp_users.pickle', 'r'))
print nx.info(graph)
print nx.is_connected(graph)

# Given a user u, find its info
u = random.choice(graph.nodes()) # here, u is a user_id string,
                                 # e.g. u"ND6DMIKxM8Q1ShEMZuA5rA"
u_info = db.users.find_one({'user_id': u})
print u_info
print u_info['average_stars']
print u_info['review_count']
```

### Dataset reference
[Yelp Dataset Challenge](http://www.yelp.com/dataset_challenge)


	### Installation (Ubuntu OS, others similar)

	```bash
	sudo apt-get install mongodb
	sudo pip install pymongo # or sudo easy_install pymongo
	```

	### Restore the data into mongodb

	```bash
	mongorestore --db yelp ./dump/yelp
	```
	The data will be stored into db named `yelp`, under `yelp`, there will be 5 collections:
	```
	businesses
	chekins
	reviews
	tips
	users
	```
	These are all the meta data that can be used as the estimation.


	### Restore the pickle graph
	For the yelp user graph, I have preprocessed all the users in `users` collection and store the whole connected undirected graph as `yelp_users.pickle`, so you can access the user graph by
	```python
	import pickle
	graph = pickle.load(open('./yelp_users.pickle', 'r'))
	print graph.nodes()[0] // The node's label is unicode string, e.g. u'MtX0WZ4bqMfFuYvtupgRqg'

	# Graph info
	# Number of nodes: 119839
	# Number of edges: 954116
	# Average degree: 15.9233
	```




	At this point, you can run your program as before, the graph is just plain networkx graph. If you want rich feature/info of users, please follow the following code.

	### Rich info

	Each user is stored as the following structure:
	```json
	{
	'type': 'user',
	'user_id': (unique user identifier),
	'name': (first name, last initial, like 'Matt J.'),
	'review_count': (review count),
	'average_stars': (floating point average, like 4.31),
	'votes': {
	'useful': (count of useful votes across all reviews),
	'funny': (count of funny votes across all reviews),
	'cool': (count of cool votes across all reviews)
	}
	}
	```

	For example,
	```json
	{
	"_id":ObjectId("53e282f680183c24b7fb0dda"),
	"yelping_since":"2013-11",
	"votes":{
	"funny":0,
	"useful":2,
	"cool":0
	},
	"review_count":9,
	"name":"Clarinda",
	"user_id":"as22TLsZn_SwVv4oCjxdMg",
	"friends":[
	"ND6DMIKxM8Q1ShEMZuA5rA",
	"M-O0tasOl0SGiUsxdO5cZw",
	"1cDIfb6TSh71n99Ark5n3A",
	"NYhXvxMqbVsB7lcFrgNqow",
	"bFe-sUMDaDm2Q9u5Cve1tQ",
	"7P9okhRRYG0hz02Fk7tRMw",
	"enR0fiE0u_jfmW2x3aqxUA",
	"W4YsRCa1Xq4wrRRf53E5ZQ",
	"b89mmlWnUfrIzpuftP3cgQ",
	"nutDqAZ0fyOmz8yAqbCvFw",
	"BPbrH2VQASzR9Le6oC4DpQ"
	],
	"fans":0,
	"average_stars":4,
	"type":"user",
	"compliments":{
	"plain":1
	},
	"elite":[
	]
	}
	```
	So, one may be able to sample users' average number of `average_stars`, `review_count`, `votes.funny`, `votes.useful`, `votes.cool` etc.

	Therefore, given a user_id, we can query the mongodb to get the right info for that user.

	```python
	import pymongo
	import networkx as nx
	import pickle
	import random

	# Prepare db and graph
	db = pymongo.MongoClient().yelp
	graph = pickle.load(open('./yelp_users.pickle', 'r'))
	print nx.info(graph)
	print nx.is_connected(graph)

	# Given a user u, find its info
	u = random.choice(graph.nodes()) # here, u is a user_id string,
	# e.g. u"ND6DMIKxM8Q1ShEMZuA5rA"
	u_info = db.users.find_one({'user_id': u})
	print u_info
	print u_info['average_stars']
	print u_info['review_count']
	```

	### Dataset reference
	[Yelp Dataset Challenge](http://www.yelp.com/dataset_challenge)