Skip to content

Instantly share code, notes, and snippets.

@andrewxhill
Created September 3, 2014 00:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrewxhill/8555086907bd7b06f9f3 to your computer and use it in GitHub Desktop.
Save andrewxhill/8555086907bd7b06f9f3 to your computer and use it in GitHub Desktop.
javier's notes on data munging yelp

How to process YELP data for Map visualization

Download the data

Take it from here:

http://www.yelp.com/dataset_challenge

Transforming the data from JSON to CSV string to be able to import it

Change quotes so that each JSON line inside the file gets imported as a row as a CSV

sed -i.bak "s/\"/\"\"/g" yelp_academic_dataset_business.json
sed -i.bak "s/\(.*\)/\"\1\"/" yelp_academic_dataset_business.json
awk 'BEGIN{print "data_json"}1' yelp_academic_dataset_business.json > yelp_bussiness.csv

Lets do the same with the checkins

sed -i.bak "s/\"/\"\"/g" yelp_academic_dataset_checkin.json
sed -i.bak "s/\(.*\)/\"\1\"/" yelp_academic_dataset_checkin.json

##Upload the yelp_bussiness.csv file to CartoDB You can do this manually just drag and drop

##Create a new table with the data parsed We are going to do a SQL query that reads the JSON and transforms it into multiple columns so CartoDB can use them easily. We will run this inside CartoDB already.

select 
data_json::json->>'type' as type, 
data_json::json->>'business_id' as business_id, 
data_json::json->>'name' as name, 
data_json::json->>'full_address' as full_address, 
data_json::json->>'city' as city, 
data_json::json->>'state' as state, 
(data_json::json->>'longitude')::float as longitude_f, 
(data_json::json->>'latitude')::float as latitude_f,
(data_json::json->>'stars')::float as stars, 
(data_json::json->>'review_count')::integer as review_count, 
(data_json::json->>'open')::boolean as open,
(data_json::json->'attributes')::json as attributes
from yelp_bussiness

If all this works fine, now we create a table from this view. Click on Option and then "Table from Query". Give it a name such as yelp_businesses.

After that we can georeference the table clicking on Options and selecting the latitude and longitude columns.

Now we have a table with all the business already imported.

create table checkins AS SELECT '2014-07-06 00:00'::timestamp + ( ( (((string_to_array(json_data.key,'-'))[2]::integer)*24) + (string_to_array(json_data.key,'-'))[1]::integer) ||' hour')::interval as timecheckin, json_data.value::integer AS checkins, y.data_json::json->>'business_id' as business_id FROM yelp_checkin as y, json_each_text(y.data_json::json->'checkin_info') AS json_data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment