Skip to content

Instantly share code, notes, and snippets.

@vabarbosa
Last active June 28, 2017 05:22
Show Gist options
  • Save vabarbosa/76d08b1cc6f80d5fc80856a1f3f32014 to your computer and use it in GitHub Desktop.
Save vabarbosa/76d08b1cc6f80d5fc80856a1f3f32014 to your computer and use it in GitHub Desktop.
intro to notebooks with pixiedust - part 1
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intro to Notebooks with PixieDust \n",
"\n",
"<center>\n",
"<img style=\"max-width:200px; display:inline-block; padding-right:25px;\" src=\"https://libraries.mit.edu/news/files/2016/02/jupyter.png\"/>\n",
"<img style=\"max-width:200px; display:inline-block; padding-left:25px;\" src=\"https://github.com/ibm-watson-data-lab/pixiedust/raw/master/docs/_static/PixieDust%202C%20%28512x512%29.png\"/>\n",
" \n",
"<br/> \n",
"</center> \n",
"\n",
"### PART I \n",
"\n",
"* `display()` API \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Jupyter Notebooks\n",
"\n",
"[Jupyter Notebooks](https://jupyter.org/) are a powerful tool for fast and flexible data analysis and can contain live code, equations, visualizations and explanatory text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PixieDust\n",
"\n",
"[PixieDust](https://github.com/ibm-cds-labs/pixiedust) is an open source Python helper library that works as an add-on to Jupyter notebooks to extends the usability of notebooks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br/> \n",
"\n",
"#### Install/Update PixieDust \n",
"\n",
"Make sure to have the latest version of PixieDust"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# !pip install --upgrade pixiedust"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br/> \n",
"\n",
"#### Import PixieDust\n",
"\n",
"Before, you can use the PixieDust library it must be imported into the notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pixiedust"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br/> \n",
"\n",
"## One way to create a scatterplot \n",
"\n",
"\n",
"#### Load CSV data into a dataframe \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Load the csv\n",
"\n",
"path=\"cars.csv\"\n",
"df3 = sqlContext.read.format('com.databricks.spark.csv')\\\n",
" .options(header='true', mode=\"DROPMALFORMED\", inferschema='true').load(path)\n",
"df3.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Plot the data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from pyspark.sql.types import DecimalType\n",
"import matplotlib.pyplot as plt\n",
"from matplotlib import cm\n",
"import math\n",
"\n",
"maxRows = 500\n",
"def toPandas(workingDF): \n",
" decimals = []\n",
" for f in workingDF.schema.fields:\n",
" if f.dataType.__class__ == DecimalType:\n",
" decimals.append(f.name)\n",
"\n",
" pdf = workingDF.toPandas()\n",
" for y in pdf.columns:\n",
" if pdf[y].dtype.name == \"object\" and y in decimals:\n",
" #spark converts Decimal type to object during toPandas, cast it as float\n",
" pdf[y] = pdf[y].astype(float)\n",
"\n",
" return pdf\n",
"\n",
"xFields = [\"horsepower\"]\n",
"yFields = [\"mpg\"]\n",
"workingDF = df3.select(xFields + yFields)\n",
"workingDF = workingDF.dropna()\n",
"count = workingDF.count()\n",
"if count > maxRows:\n",
" workingDF = workingDF.sample(False, (float(maxRows) / float(count)))\n",
"pdf = toPandas(workingDF)\n",
"#sort by xFields\n",
"pdf.sort_values(xFields, inplace=True)\n",
"\n",
"fig, ax = plt.subplots(figsize=( int(1000/ 96), int(750 / 96) ))\n",
"\n",
"for i,keyField in enumerate(xFields):\n",
" pdf.plot(kind='scatter', x=keyField, y=yFields[0], label=keyField, ax=ax, color=cm.jet(1.*i/len(xFields)))\n",
"\n",
"#Conf the legend\n",
"if ax.get_legend() is not None and ax.title is None or not ax.title.get_visible() or ax.title.get_text() == '':\n",
" numLabels = len(ax.get_legend_handles_labels()[1])\n",
" nCol = int(min(max(math.sqrt( numLabels ), 3), 6))\n",
" nRows = int(numLabels/nCol)\n",
" bboxPos = max(1.15, 1.0 + ((float(nRows)/2)/10.0))\n",
" ax.legend(loc='upper center', bbox_to_anchor=(0.5, bboxPos),ncol=nCol, fancybox=True, shadow=True)\n",
"\n",
"#conf the xticks\n",
"labels = [s.get_text() for s in ax.get_xticklabels()]\n",
"totalWidth = sum(len(s) for s in labels) * 5\n",
"if totalWidth > 1000:\n",
" #filter down the list to max 20 \n",
" xl = [(i,a) for i,a in enumerate(labels) if i % int(len(labels)/20) == 0]\n",
" ax.set_xticks([x[0] for x in xl])\n",
" ax.set_xticklabels([x[1] for x in xl])\n",
" plt.xticks(rotation=30)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<br/> \n",
"\n",
"## PixieDust way to create a scatterplot \n",
"\n",
"\n",
"#### Load remote CSV data into a dataframe \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cars = pixiedust.sampleData(\"https://github.com/ibm-cds-labs/open-data/raw/master/cars/cars.csv\") \n",
"cars.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### Call display()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pixiedust": {
"displayParams": {
"color": "year",
"handlerId": "scatterPlot",
"keyFields": "horsepower",
"kind": "resid",
"rendererId": "matplotlib",
"rowCount": "500",
"valueFields": "mpg"
}
}
},
"outputs": [],
"source": [
"display(cars)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<br/> \n",
"\n",
"## display() controls \n",
"\n",
"#### Renderers \n",
"\n",
"* [Bokeh](http://bokeh.pydata.org/en/0.10.0/index.html)\n",
"* [Matplotlib](http://matplotlib.org/)\n",
"* [Seaborn](http://seaborn.pydata.org/index.html)\n",
"* [Mapbox](https://www.mapbox.com/)\n",
"* [Google GeoChart](https://developers.google.com/chart/interactive/docs/gallery/geochart)\n",
"\n",
"#### Chart options\n",
"\n",
"* **Chart types**\n",
"* **Options**\n",
"\n",
"To learn more : https://ibm-cds-labs.github.io/pixiedust/displayapi.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pixiedust": {
"displayParams": {
"aggregation": "SUM",
"chartsize": "85",
"handlerId": "mapView",
"keyFields": "state",
"rendererId": "google",
"rowCount": "500",
"valueFields": "unique_customers"
}
}
},
"outputs": [],
"source": [
"# create another dataframe, in a new variable\n",
"df2 = sqlContext.createDataFrame(\n",
"[(2010, 'Camping Equipment', 3, 'Texas'),\n",
" (2010, 'Golf Equipment', 1, 'Florida'),\n",
" (2010, 'Mountaineering Equipment', 1, 'Colorado'),\n",
" (2010, 'Outdoor Protection', 2, 'Colorado'),\n",
" (2010, 'Personal Accessories', 2, 'Massachusetts'),\n",
" (2011, 'Camping Equipment', 4, 'Colorado'),\n",
" (2011, 'Golf Equipment', 5, 'California'),\n",
" (2011, 'Mountaineering Equipment', 2, 'California'),\n",
" (2011, 'Outdoor Protection', 4, 'California'),\n",
" (2011, 'Personal Accessories', 2, 'California'),\n",
" (2012, 'Camping Equipment', 5, 'Texas'),\n",
" (2012, 'Golf Equipment', 5, 'Massachusetts'),\n",
" (2012, 'Mountaineering Equipment', 3, 'Washington'),\n",
" (2012, 'Outdoor Protection', 5, 'Maine'),\n",
" (2012, 'Personal Accessories', 3, 'New York'),\n",
" (2013, 'Camping Equipment', 8, 'Maine'),\n",
" (2013, 'Golf Equipment', 5, 'Florida'),\n",
" (2013, 'Mountaineering Equipment', 3, 'Vermont'),\n",
" (2013, 'Outdoor Protection', 8, 'Vermont'),\n",
" (2013, 'Personal Accessories', 4, 'Massachusetts')],\n",
"[\"year\",\"category\",\"unique_customers\", \"state\"])\n",
"\n",
"# This time, we've combined the dataframe and display() call in the same cell\n",
"# Run this cell \n",
"display(df2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "pySpark (Spark 1.6.0) Python 2",
"language": "python",
"name": "pyspark1.6"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment