rmoff/BDD - Import Simple JSON.ipynb

## BDD - Import Simple JSON.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instantiate the BDD Shell context"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "execfile('ipython/00-bdd-shell-init.py')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import JSON using SparkSQL\n",
    "\n",
    "(Data source: http://www.yelp.co.uk/dataset_challenge)\n",
    "\n",
    "Note that SparkSQL expects the JSON to be a single record per line. If you've got a single JSON object then you'll need to modify this code accordingly. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = sqlContext.read.json('file:///u02/custom/yelp/yelp_academic_dataset_tip.small.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preview schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- business_id: string (nullable = true)\n",
      " |-- date: string (nullable = true)\n",
      " |-- likes: long (nullable = true)\n",
      " |-- text: string (nullable = true)\n",
      " |-- type: string (nullable = true)\n",
      " |-- user_id: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+----------+-----+--------------------+----+--------------------+\n",
      "|         business_id|      date|likes|                text|type|             user_id|\n",
      "+--------------------+----------+-----+--------------------+----+--------------------+\n",
      "|cE27W9VPgO88Qxe4o...|2013-04-18|    0|Don't waste your ...| tip|-6rEfobYjMxpUWLNx...|\n",
      "|mVHrayjG3uZ_RLHkL...|2013-01-06|    1|Your GPS will not...| tip|EZ0r9dKKtEGVx2Cdn...|\n",
      "|KayYbHCt-RkbGcPdG...|2013-12-03|    0|Great drink speci...| tip|xb6zEQCw9I-Gl0g06...|\n",
      "|KayYbHCt-RkbGcPdG...|2015-07-08|    0|Friendly staff, g...| tip|QawZN4PSW7ng_9SP7...|\n",
      "|1_lU0-eSWJCRvNGk7...|2015-10-25|    0|Beautiful restora...| tip|MLQre1nvUtW-RqMTc...|\n",
      "+--------------------+----------+-----+--------------------+----+--------------------+\n",
      "only showing top 5 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Write the Spark dataframe as a Hive table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tablename = 'tmp01'\n",
    "qualified_tablename='default.' + tablename\n",
    "df.write.mode('Overwrite').saveAsTable(qualified_tablename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Invoke DP CLI to add the new Hive table to the BDD Catalog\n",
    "\n",
    "*For this to work, you need :* \n",
    "\n",
    "* *amend `spark-defaults.conf` to **comment out/delete** the following line:*\n",
    "\n",
    "        spark.serializer=org.apache.spark.serializer.KryoSerializer\n",
    "    \n",
    "* *update `/etc/hadoop/conf/yarn-site.xml` and add a configuration for `yarn.nodemanager.resource.memory-mb` which I set to `16000`. Without this, the `data_processing_CLI` job could be seen in YARN as stuck at ACCEPTED.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from subprocess import call\n",
    "call([\"/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI\",\"--table\",tablename])\n",
    "\n",
    "# If you don't want to run the import automagically, uncomment the following to get the statement to run\n",
    "# for when you do want to run the import\n",
    "# print '/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table %s' % tablename"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Instantiate the BDD Shell context"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"execfile('ipython/00-bdd-shell-init.py')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Import JSON using SparkSQL\n",
	"\n",
	"(Data source: http://www.yelp.co.uk/dataset_challenge)\n",
	"\n",
	"Note that SparkSQL expects the JSON to be a single record per line. If you've got a single JSON object then you'll need to modify this code accordingly. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 34,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"df = sqlContext.read.json('file:///u02/custom/yelp/yelp_academic_dataset_tip.small.json')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Preview schema"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 35,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"root\n",
	" \|-- business_id: string (nullable = true)\n",
	" \|-- date: string (nullable = true)\n",
	" \|-- likes: long (nullable = true)\n",
	" \|-- text: string (nullable = true)\n",
	" \|-- type: string (nullable = true)\n",
	" \|-- user_id: string (nullable = true)\n",
	"\n"
	]
	}
	],
	"source": [
	"df.printSchema()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 38,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+--------------------+----------+-----+--------------------+----+--------------------+\n",
	"\| business_id\| date\|likes\| text\|type\| user_id\|\n",
	"+--------------------+----------+-----+--------------------+----+--------------------+\n",
	"\|cE27W9VPgO88Qxe4o...\|2013-04-18\| 0\|Don't waste your ...\| tip\|-6rEfobYjMxpUWLNx...\|\n",
	"\|mVHrayjG3uZ_RLHkL...\|2013-01-06\| 1\|Your GPS will not...\| tip\|EZ0r9dKKtEGVx2Cdn...\|\n",
	"\|KayYbHCt-RkbGcPdG...\|2013-12-03\| 0\|Great drink speci...\| tip\|xb6zEQCw9I-Gl0g06...\|\n",
	"\|KayYbHCt-RkbGcPdG...\|2015-07-08\| 0\|Friendly staff, g...\| tip\|QawZN4PSW7ng_9SP7...\|\n",
	"\|1_lU0-eSWJCRvNGk7...\|2015-10-25\| 0\|Beautiful restora...\| tip\|MLQre1nvUtW-RqMTc...\|\n",
	"+--------------------+----------+-----+--------------------+----+--------------------+\n",
	"only showing top 5 rows\n",
	"\n"
	]
	}
	],
	"source": [
	"df.show(5)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Write the Spark dataframe as a Hive table"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 39,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"tablename = 'tmp01'\n",
	"qualified_tablename='default.' + tablename\n",
	"df.write.mode('Overwrite').saveAsTable(qualified_tablename)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"# Invoke DP CLI to add the new Hive table to the BDD Catalog\n",
	"\n",
	"For this to work, you need : \n",
	"\n",
	"* amend `spark-defaults.conf` to comment out/delete* the following line:*\n",
	"\n",
	" spark.serializer=org.apache.spark.serializer.KryoSerializer\n",
	" \n",
	"* update `/etc/hadoop/conf/yarn-site.xml` and add a configuration for `yarn.nodemanager.resource.memory-mb` which I set to `16000`. Without this, the `data_processing_CLI` job could be seen in YARN as stuck at ACCEPTED."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from subprocess import call\n",
	"call([\"/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI\",\"--table\",tablename])\n",
	"\n",
	"# If you don't want to run the import automagically, uncomment the following to get the statement to run\n",
	"# for when you do want to run the import\n",
	"# print '/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table %s' % tablename"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}