vabarbosa/notebooks with pixiedust - 1.ipynb

## notebooks with pixiedust - 1.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro to Notebooks with PixieDust  \n",
    "\n",
    "<center>\n",
    "<img style=\"max-width:200px; display:inline-block; padding-right:25px;\" src=\"https://libraries.mit.edu/news/files/2016/02/jupyter.png\"/>\n",
    "<img style=\"max-width:200px; display:inline-block; padding-left:25px;\" src=\"https://github.com/ibm-watson-data-lab/pixiedust/raw/master/docs/_static/PixieDust%202C%20%28512x512%29.png\"/>\n",
    "    \n",
    "<br/>  \n",
    "</center>  \n",
    "\n",
    "### PART I  \n",
    "\n",
    "* `display()` API  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Jupyter Notebooks\n",
    "\n",
    "[Jupyter Notebooks](https://jupyter.org/) are a powerful tool for fast and flexible data analysis and can contain live code, equations, visualizations and explanatory text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PixieDust\n",
    "\n",
    "[PixieDust](https://github.com/ibm-cds-labs/pixiedust) is an open source Python helper library that works as an add-on to Jupyter notebooks to extends the usability of notebooks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>  \n",
    "\n",
    "#### Install/Update PixieDust  \n",
    "\n",
    "Make sure to have the latest version of PixieDust"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# !pip install --upgrade pixiedust"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>  \n",
    "\n",
    "#### Import PixieDust\n",
    "\n",
    "Before, you can use the PixieDust library it must be imported into the notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pixiedust"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br/>  \n",
    "\n",
    "## One way to create a scatterplot  \n",
    "\n",
    "\n",
    "#### Load CSV data into a dataframe  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Load the csv\n",
    "\n",
    "path=\"cars.csv\"\n",
    "df3 = sqlContext.read.format('com.databricks.spark.csv')\\\n",
    "    .options(header='true', mode=\"DROPMALFORMED\", inferschema='true').load(path)\n",
    "df3.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### Plot the data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyspark.sql.types import DecimalType\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib import cm\n",
    "import math\n",
    "\n",
    "maxRows = 500\n",
    "def toPandas(workingDF):        \n",
    "    decimals = []\n",
    "    for f in workingDF.schema.fields:\n",
    "        if f.dataType.__class__ == DecimalType:\n",
    "            decimals.append(f.name)\n",
    "\n",
    "    pdf = workingDF.toPandas()\n",
    "    for y in pdf.columns:\n",
    "        if pdf[y].dtype.name == \"object\" and y in decimals:\n",
    "            #spark converts Decimal type to object during toPandas, cast it as float\n",
    "            pdf[y] = pdf[y].astype(float)\n",
    "\n",
    "    return pdf\n",
    "\n",
    "xFields = [\"horsepower\"]\n",
    "yFields = [\"mpg\"]\n",
    "workingDF = df3.select(xFields + yFields)\n",
    "workingDF = workingDF.dropna()\n",
    "count = workingDF.count()\n",
    "if count > maxRows:\n",
    "    workingDF = workingDF.sample(False, (float(maxRows) / float(count)))\n",
    "pdf = toPandas(workingDF)\n",
    "#sort by xFields\n",
    "pdf.sort_values(xFields, inplace=True)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) ))\n",
    "\n",
    "for i,keyField in enumerate(xFields):\n",
    "    pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields)))\n",
    "\n",
    "#Conf the legend\n",
    "if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '':\n",
    "    numLabels = len(ax.get_legend_handles_labels()[1])\n",
    "    nCol = int(min(max(math.sqrt( numLabels ), 3), 6))\n",
    "    nRows = int(numLabels/nCol)\n",
    "    bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0))\n",
    "    ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True)\n",
    "\n",
    "#conf the xticks\n",
    "labels = [s.get_text() for s in ax.get_xticklabels()]\n",
    "totalWidth = sum(len(s) for s in labels) * 5\n",
    "if totalWidth > 1000:\n",
    "    #filter down the list to max 20        \n",
    "    xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0]\n",
    "    ax.set_xticks([x[0] for x in xl])\n",
    "    ax.set_xticklabels([x[1] for x in xl])\n",
    "    plt.xticks(rotation=30)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "<br/>  \n",
    "\n",
    "## PixieDust way to create a scatterplot  \n",
    "\n",
    "\n",
    "#### Load remote CSV data into a dataframe  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cars = pixiedust.sampleData(\"https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv\") \n",
    "cars.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "#### Call display()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pixiedust": {
     "displayParams": {
      "color": "year",
      "handlerId": "scatterPlot",
      "keyFields": "horsepower",
      "kind": "resid",
      "rendererId": "matplotlib",
      "rowCount": "500",
      "valueFields": "mpg"
     }
    }
   },
   "outputs": [],
   "source": [
    "display(cars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "<br/>  \n",
    "\n",
    "## display() controls  \n",
    "\n",
    "#### Renderers  \n",
    "\n",
    "* [Bokeh](http://bokeh.pydata.org/en/0.10.0/index.html)\n",
    "* [Matplotlib](http://matplotlib.org/)\n",
    "* [Seaborn](http://seaborn.pydata.org/index.html)\n",
    "* [Mapbox](https://www.mapbox.com/)\n",
    "* [Google GeoChart](https://developers.google.com/chart/interactive/docs/gallery/geochart)\n",
    "\n",
    "#### Chart options\n",
    "\n",
    "* **Chart types**\n",
    "* **Options**\n",
    "\n",
    "To learn more : https://ibm-cds-labs.github.io/pixiedust/displayapi.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pixiedust": {
     "displayParams": {
      "aggregation": "SUM",
      "chartsize": "85",
      "handlerId": "mapView",
      "keyFields": "state",
      "rendererId": "google",
      "rowCount": "500",
      "valueFields": "unique_customers"
     }
    }
   },
   "outputs": [],
   "source": [
    "# create another dataframe, in a new variable\n",
    "df2 = sqlContext.createDataFrame(\n",
    "[(2010, 'Camping Equipment', 3, 'Texas'),\n",
    " (2010, 'Golf Equipment', 1, 'Florida'),\n",
    " (2010, 'Mountaineering Equipment', 1, 'Colorado'),\n",
    " (2010, 'Outdoor Protection', 2, 'Colorado'),\n",
    " (2010, 'Personal Accessories', 2, 'Massachusetts'),\n",
    " (2011, 'Camping Equipment', 4, 'Colorado'),\n",
    " (2011, 'Golf Equipment', 5, 'California'),\n",
    " (2011, 'Mountaineering Equipment', 2, 'California'),\n",
    " (2011, 'Outdoor Protection', 4, 'California'),\n",
    " (2011, 'Personal Accessories', 2, 'California'),\n",
    " (2012, 'Camping Equipment', 5, 'Texas'),\n",
    " (2012, 'Golf Equipment', 5, 'Massachusetts'),\n",
    " (2012, 'Mountaineering Equipment', 3, 'Washington'),\n",
    " (2012, 'Outdoor Protection', 5, 'Maine'),\n",
    " (2012, 'Personal Accessories', 3, 'New York'),\n",
    " (2013, 'Camping Equipment', 8, 'Maine'),\n",
    " (2013, 'Golf Equipment', 5, 'Florida'),\n",
    " (2013, 'Mountaineering Equipment', 3, 'Vermont'),\n",
    " (2013, 'Outdoor Protection', 8, 'Vermont'),\n",
    " (2013, 'Personal Accessories', 4, 'Massachusetts')],\n",
    "[\"year\",\"category\",\"unique_customers\", \"state\"])\n",
    "\n",
    "# This time, we've combined the dataframe and display() call in the same cell\n",
    "# Run this cell \n",
    "display(df2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "pySpark (Spark 1.6.0) Python 2",
   "language": "python",
   "name": "pyspark1.6"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Intro to Notebooks with PixieDust \n",
	"\n",
	"<center>\n",
	"<img style=\"max-width:200px; display:inline-block; padding-right:25px;\" src=\"https://libraries.mit.edu/news/files/2016/02/jupyter.png\"/>\n",
	"<img style=\"max-width:200px; display:inline-block; padding-left:25px;\" src=\"https://github.com/ibm-watson-data-lab/pixiedust/raw/master/docs/_static/PixieDust%202C%20%28512x512%29.png\"/>\n",
	" \n",
	"<br/> \n",
	"</center> \n",
	"\n",
	"### PART I \n",
	"\n",
	"* `display()` API \n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Jupyter Notebooks\n",
	"\n",
	"[Jupyter Notebooks](https://jupyter.org/) are a powerful tool for fast and flexible data analysis and can contain live code, equations, visualizations and explanatory text."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## PixieDust\n",
	"\n",
	"[PixieDust](https://github.com/ibm-cds-labs/pixiedust) is an open source Python helper library that works as an add-on to Jupyter notebooks to extends the usability of notebooks."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<br/> \n",
	"\n",
	"#### Install/Update PixieDust \n",
	"\n",
	"Make sure to have the latest version of PixieDust"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# !pip install --upgrade pixiedust"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<br/> \n",
	"\n",
	"#### Import PixieDust\n",
	"\n",
	"Before, you can use the PixieDust library it must be imported into the notebook"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import pixiedust"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<br/> \n",
	"\n",
	"## One way to create a scatterplot \n",
	"\n",
	"\n",
	"#### Load CSV data into a dataframe \n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"#Load the csv\n",
	"\n",
	"path=\"cars.csv\"\n",
	"df3 = sqlContext.read.format('com.databricks.spark.csv')\\\n",
	" .options(header='true', mode=\"DROPMALFORMED\", inferschema='true').load(path)\n",
	"df3.count()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\n",
	"#### Plot the data\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from pyspark.sql.types import DecimalType\n",
	"import matplotlib.pyplot as plt\n",
	"from matplotlib import cm\n",
	"import math\n",
	"\n",
	"maxRows = 500\n",
	"def toPandas(workingDF): \n",
	" decimals = []\n",
	" for f in workingDF.schema.fields:\n",
	" if f.dataType.__class__ == DecimalType:\n",
	" decimals.append(f.name)\n",
	"\n",
	" pdf = workingDF.toPandas()\n",
	" for y in pdf.columns:\n",
	" if pdf[y].dtype.name == \"object\" and y in decimals:\n",
	" #spark converts Decimal type to object during toPandas, cast it as float\n",
	" pdf[y] = pdf[y].astype(float)\n",
	"\n",
	" return pdf\n",
	"\n",
	"xFields = [\"horsepower\"]\n",
	"yFields = [\"mpg\"]\n",
	"workingDF = df3.select(xFields + yFields)\n",
	"workingDF = workingDF.dropna()\n",
	"count = workingDF.count()\n",
	"if count > maxRows:\n",
	" workingDF = workingDF.sample(False, (float(maxRows) / float(count)))\n",
	"pdf = toPandas(workingDF)\n",
	"#sort by xFields\n",
	"pdf.sort_values(xFields, inplace=True)\n",
	"\n",
	"fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) ))\n",
	"\n",
	"for i,keyField in enumerate(xFields):\n",
	" pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields)))\n",
	"\n",
	"#Conf the legend\n",
	"if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '':\n",
	" numLabels = len(ax.get_legend_handles_labels()[1])\n",
	" nCol = int(min(max(math.sqrt( numLabels ), 3), 6))\n",
	" nRows = int(numLabels/nCol)\n",
	" bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0))\n",
	" ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True)\n",
	"\n",
	"#conf the xticks\n",
	"labels = [s.get_text() for s in ax.get_xticklabels()]\n",
	"totalWidth = sum(len(s) for s in labels) * 5\n",
	"if totalWidth > 1000:\n",
	" #filter down the list to max 20 \n",
	" xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0]\n",
	" ax.set_xticks([x[0] for x in xl])\n",
	" ax.set_xticklabels([x[1] for x in xl])\n",
	" plt.xticks(rotation=30)\n",
	"\n",
	"plt.show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"<br/> \n",
	"\n",
	"## PixieDust way to create a scatterplot \n",
	"\n",
	"\n",
	"#### Load remote CSV data into a dataframe \n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"cars = pixiedust.sampleData(\"https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv\") \n",
	"cars.count()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\n",
	"#### Call display()\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"pixiedust": {
	"displayParams": {
	"color": "year",
	"handlerId": "scatterPlot",
	"keyFields": "horsepower",
	"kind": "resid",
	"rendererId": "matplotlib",
	"rowCount": "500",
	"valueFields": "mpg"
	}
	}
	},
	"outputs": [],
	"source": [
	"display(cars)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"<br/> \n",
	"\n",
	"## display() controls \n",
	"\n",
	"#### Renderers \n",
	"\n",
	"* [Bokeh](http://bokeh.pydata.org/en/0.10.0/index.html)\n",
	"* [Matplotlib](http://matplotlib.org/)\n",
	"* [Seaborn](http://seaborn.pydata.org/index.html)\n",
	"* [Mapbox](https://www.mapbox.com/)\n",
	"* [Google GeoChart](https://developers.google.com/chart/interactive/docs/gallery/geochart)\n",
	"\n",
	"#### Chart options\n",
	"\n",
	"* Chart types\n",
	"* Options\n",
	"\n",
	"To learn more : https://ibm-cds-labs.github.io/pixiedust/displayapi.html"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": false,
	"pixiedust": {
	"displayParams": {
	"aggregation": "SUM",
	"chartsize": "85",
	"handlerId": "mapView",
	"keyFields": "state",
	"rendererId": "google",
	"rowCount": "500",
	"valueFields": "unique_customers"
	}
	}
	},
	"outputs": [],
	"source": [
	"# create another dataframe, in a new variable\n",
	"df2 = sqlContext.createDataFrame(\n",
	"[(2010, 'Camping Equipment', 3, 'Texas'),\n",
	" (2010, 'Golf Equipment', 1, 'Florida'),\n",
	" (2010, 'Mountaineering Equipment', 1, 'Colorado'),\n",
	" (2010, 'Outdoor Protection', 2, 'Colorado'),\n",
	" (2010, 'Personal Accessories', 2, 'Massachusetts'),\n",
	" (2011, 'Camping Equipment', 4, 'Colorado'),\n",
	" (2011, 'Golf Equipment', 5, 'California'),\n",
	" (2011, 'Mountaineering Equipment', 2, 'California'),\n",
	" (2011, 'Outdoor Protection', 4, 'California'),\n",
	" (2011, 'Personal Accessories', 2, 'California'),\n",
	" (2012, 'Camping Equipment', 5, 'Texas'),\n",
	" (2012, 'Golf Equipment', 5, 'Massachusetts'),\n",
	" (2012, 'Mountaineering Equipment', 3, 'Washington'),\n",
	" (2012, 'Outdoor Protection', 5, 'Maine'),\n",
	" (2012, 'Personal Accessories', 3, 'New York'),\n",
	" (2013, 'Camping Equipment', 8, 'Maine'),\n",
	" (2013, 'Golf Equipment', 5, 'Florida'),\n",
	" (2013, 'Mountaineering Equipment', 3, 'Vermont'),\n",
	" (2013, 'Outdoor Protection', 8, 'Vermont'),\n",
	" (2013, 'Personal Accessories', 4, 'Massachusetts')],\n",
	"[\"year\",\"category\",\"unique_customers\", \"state\"])\n",
	"\n",
	"# This time, we've combined the dataframe and display() call in the same cell\n",
	"# Run this cell \n",
	"display(df2)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "pySpark (Spark 1.6.0) Python 2",
	"language": "python",
	"name": "pyspark1.6"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}