Created
July 13, 2016 07:33
-
-
Save rmoff/8b7118c230f34f7d57bd9b0aa4e0c34c to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Instantiate the BDD Shell context" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"execfile('ipython/00-bdd-shell-init.py')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Import JSON using SparkSQL\n", | |
"\n", | |
"(Data source: http://www.yelp.co.uk/dataset_challenge)\n", | |
"\n", | |
"Note that SparkSQL expects the JSON to be a single record per line. If you've got a single JSON object then you'll need to modify this code accordingly. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df = sqlContext.read.json('file:///u02/custom/yelp/yelp_academic_dataset_tip.small.json')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Preview schema" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"root\n", | |
" |-- business_id: string (nullable = true)\n", | |
" |-- date: string (nullable = true)\n", | |
" |-- likes: long (nullable = true)\n", | |
" |-- text: string (nullable = true)\n", | |
" |-- type: string (nullable = true)\n", | |
" |-- user_id: string (nullable = true)\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.printSchema()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+--------------------+----------+-----+--------------------+----+--------------------+\n", | |
"| business_id| date|likes| text|type| user_id|\n", | |
"+--------------------+----------+-----+--------------------+----+--------------------+\n", | |
"|cE27W9VPgO88Qxe4o...|2013-04-18| 0|Don't waste your ...| tip|-6rEfobYjMxpUWLNx...|\n", | |
"|mVHrayjG3uZ_RLHkL...|2013-01-06| 1|Your GPS will not...| tip|EZ0r9dKKtEGVx2Cdn...|\n", | |
"|KayYbHCt-RkbGcPdG...|2013-12-03| 0|Great drink speci...| tip|xb6zEQCw9I-Gl0g06...|\n", | |
"|KayYbHCt-RkbGcPdG...|2015-07-08| 0|Friendly staff, g...| tip|QawZN4PSW7ng_9SP7...|\n", | |
"|1_lU0-eSWJCRvNGk7...|2015-10-25| 0|Beautiful restora...| tip|MLQre1nvUtW-RqMTc...|\n", | |
"+--------------------+----------+-----+--------------------+----+--------------------+\n", | |
"only showing top 5 rows\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.show(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Write the Spark dataframe as a Hive table" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"tablename = 'tmp01'\n", | |
"qualified_tablename='default.' + tablename\n", | |
"df.write.mode('Overwrite').saveAsTable(qualified_tablename)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"# Invoke DP CLI to add the new Hive table to the BDD Catalog\n", | |
"\n", | |
"*For this to work, you need :* \n", | |
"\n", | |
"* *amend `spark-defaults.conf` to **comment out/delete** the following line:*\n", | |
"\n", | |
" spark.serializer=org.apache.spark.serializer.KryoSerializer\n", | |
" \n", | |
"* *update `/etc/hadoop/conf/yarn-site.xml` and add a configuration for `yarn.nodemanager.resource.memory-mb` which I set to `16000`. Without this, the `data_processing_CLI` job could be seen in YARN as stuck at ACCEPTED.*" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"from subprocess import call\n", | |
"call([\"/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI\",\"--table\",tablename])\n", | |
"\n", | |
"# If you don't want to run the import automagically, uncomment the following to get the statement to run\n", | |
"# for when you do want to run the import\n", | |
"# print '/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table %s' % tablename" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment