Skip to content

Instantly share code, notes, and snippets.

@rmoff
Created July 13, 2016 07:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rmoff/8b7118c230f34f7d57bd9b0aa4e0c34c to your computer and use it in GitHub Desktop.
Save rmoff/8b7118c230f34f7d57bd9b0aa4e0c34c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Instantiate the BDD Shell context"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"execfile('ipython/00-bdd-shell-init.py')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Import JSON using SparkSQL\n",
"\n",
"(Data source: http://www.yelp.co.uk/dataset_challenge)\n",
"\n",
"Note that SparkSQL expects the JSON to be a single record per line. If you've got a single JSON object then you'll need to modify this code accordingly. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df = sqlContext.read.json('file:///u02/custom/yelp/yelp_academic_dataset_tip.small.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preview schema"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"root\n",
" |-- business_id: string (nullable = true)\n",
" |-- date: string (nullable = true)\n",
" |-- likes: long (nullable = true)\n",
" |-- text: string (nullable = true)\n",
" |-- type: string (nullable = true)\n",
" |-- user_id: string (nullable = true)\n",
"\n"
]
}
],
"source": [
"df.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+----------+-----+--------------------+----+--------------------+\n",
"| business_id| date|likes| text|type| user_id|\n",
"+--------------------+----------+-----+--------------------+----+--------------------+\n",
"|cE27W9VPgO88Qxe4o...|2013-04-18| 0|Don't waste your ...| tip|-6rEfobYjMxpUWLNx...|\n",
"|mVHrayjG3uZ_RLHkL...|2013-01-06| 1|Your GPS will not...| tip|EZ0r9dKKtEGVx2Cdn...|\n",
"|KayYbHCt-RkbGcPdG...|2013-12-03| 0|Great drink speci...| tip|xb6zEQCw9I-Gl0g06...|\n",
"|KayYbHCt-RkbGcPdG...|2015-07-08| 0|Friendly staff, g...| tip|QawZN4PSW7ng_9SP7...|\n",
"|1_lU0-eSWJCRvNGk7...|2015-10-25| 0|Beautiful restora...| tip|MLQre1nvUtW-RqMTc...|\n",
"+--------------------+----------+-----+--------------------+----+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"df.show(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Write the Spark dataframe as a Hive table"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"tablename = 'tmp01'\n",
"qualified_tablename='default.' + tablename\n",
"df.write.mode('Overwrite').saveAsTable(qualified_tablename)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Invoke DP CLI to add the new Hive table to the BDD Catalog\n",
"\n",
"*For this to work, you need :* \n",
"\n",
"* *amend `spark-defaults.conf` to **comment out/delete** the following line:*\n",
"\n",
" spark.serializer=org.apache.spark.serializer.KryoSerializer\n",
" \n",
"* *update `/etc/hadoop/conf/yarn-site.xml` and add a configuration for `yarn.nodemanager.resource.memory-mb` which I set to `16000`. Without this, the `data_processing_CLI` job could be seen in YARN as stuck at ACCEPTED.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from subprocess import call\n",
"call([\"/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI\",\"--table\",tablename])\n",
"\n",
"# If you don't want to run the import automagically, uncomment the following to get the statement to run\n",
"# for when you do want to run the import\n",
"# print '/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table %s' % tablename"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment