Skip to content

Instantly share code, notes, and snippets.

@bcyoungV
Created April 18, 2021 09:21
Show Gist options
  • Save bcyoungV/2d255e945665e806248c2533bf876e3d to your computer and use it in GitHub Desktop.
Save bcyoungV/2d255e945665e806248c2533bf876e3d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "<div style=\"background:#F5F7FA; height:100px; padding: 2em; font-size:14px;\">\n<span style=\"font-size:18px;color:#152935;\">Want to do more?</span><span style=\"border: 1px solid #3d70b2;padding: 15px;float:right;margin-right:40px; color:#3d70b2; \"><a href=\"https://ibm.co/wsnotebooks\" target=\"_blank\" style=\"color: #3d70b2;text-decoration: none;\">Sign Up</a></span><br>\n<span style=\"color:#5A6872;\"> Try out this notebook with your free trial of IBM Watson Studio.</span>\n</div>"
},
{
"metadata": {
"collapsed": true
},
"cell_type": "markdown",
"source": "# Welcome to PixieDust\n\nThis notebook features an introduction to [PixieDust](https://ibm-watson-data-lab.github.io/pixiedust/index.html), the Python library that makes data visualization easy. \n\n## <a id=\"toc\"></a>Table of Contents\n\n * [Get started](#part_one)\n * [Load text data from remote sources](#part_two)\n * [Contribute](#contribute)\n\n\n<hr>\n\n# <a id=\"part_one\"></a>Get started\n\nThis introduction is pretty straightforward, but it wouldn't hurt to load up the [PixieDust documentation](https://ibm-watson-data-lab.github.io/pixiedust/) so it's handy. \n\nNew to notebooks? Don't worry. Here's all you need to know to run this introduction:\n\n1. Make sure this notebook is in Edit mode\n1. To run code cells, put your cursor in the cell and press **Shift + Enter**.\n1. The cell number will change to **[\\*]** to indicate that it is currently executing. (When starting with notebooks, it's best to run cells in order, one at a time.)"
},
{
"metadata": {
"scrolled": true
},
"cell_type": "code",
"source": "# To confirm you have the latest version of PixieDust on your system, run this cell\n!pip install -U --no-deps pixiedust",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "Waiting for a Spark session to start...\nSpark Initialization Done! ApplicationId = app-20201115125805-0000\nKERNEL_ID = 496fe3cb-d26d-4a4a-9ea9-d1d216cd8b92\nCollecting pixiedust\n Downloading pixiedust-1.1.18.tar.gz (197 kB)\n\u001b[K |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 197 kB 17.3 MB/s eta 0:00:01\n\u001b[?25hBuilding wheels for collected packages: pixiedust\n Building wheel for pixiedust (setup.py) ... \u001b[?25ldone\n\u001b[?25h Created wheel for pixiedust: filename=pixiedust-1.1.18-py3-none-any.whl size=321728 sha256=3679b0ea56aa16c78f76c24b3a8300a4c5eed6191c3ddf6bb6705f255e60c0d9\n Stored in directory: /home/spark/shared/.cache/pip/wheels/41/4c/20/08a843440aaeffc976c1848c9eb44be6ec68dcd964421ec6f7\nSuccessfully built pixiedust\nInstalling collected packages: pixiedust\nSuccessfully installed pixiedust-1.1.18\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now that you have PixieDust installed and up-to-date on your system, you need to import it into this notebook. This is the last dependency before you can play with PixieDust."
},
{
"metadata": {},
"cell_type": "code",
"source": "import pixiedust",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "Waiting for a Spark session to start...\nSpark Initialization Done! ApplicationId = app-20201115125904-0001\nKERNEL_ID = 496fe3cb-d26d-4a4a-9ea9-d1d216cd8b92\nPixiedust database opened successfully\n",
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"text/html": "\n <div style=\"margin:10px\">\n <a href=\"https://github.com/ibm-watson-data-lab/pixiedust\" target=\"_new\">\n <img src=\"https://github.com/ibm-watson-data-lab/pixiedust/raw/master/docs/_static/pd_icon32.png\" style=\"float:left;margin-right:10px\"/>\n </a>\n <span>Pixiedust version 1.1.18</span>\n </div>\n ",
"text/plain": "<IPython.core.display.HTML object>"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "If you get a message telling you that you're not running the latest version of PixieDust, restart the kernel from the **Kernel** menu and rerun the `import pixiedust` command. (Any time you restart the kernel, rerun the `import pixiedust` command.)"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Behold, display()\n\nIn the next cell, build a simple dataset and store it in a variable. "
},
{
"metadata": {},
"cell_type": "code",
"source": "# Build the SQL context required to create a Spark dataframe \nfrom pyspark.sql import SQLContext\nsqlContext=SQLContext(sc) \n# Create the Spark dataframe, passing in some data, and assign it to a variable\ndf = spark.createDataFrame(\n[(\"Cats\", 75),\n (\"Dogs\", 25)],\n[\"Pets\",\"%\"])",
"execution_count": 2,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The data in the variable `df` is ready to be visualized, without any further code other than the call to `display()`."
},
{
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "SUM",
"chartsize": "55",
"handlerId": "barChart",
"keyFields": "Pets",
"rowCount": "100",
"title": "Pets in this bar chart,by %",
"valueFields": "%",
"table_noschema": "false",
"rendererId": "matplotlib",
"tableFields": "Colors",
"table_nosearch": "false",
"table_nocount": "false",
"table_showrows": "Missing values"
}
},
"scrolled": false
},
"cell_type": "code",
"source": "# display the dataframe above as a bar chart\ndisplay(df)",
"execution_count": 3,
"outputs": [
{
"data": {
"text/html": "<style type=\"text/css\">.pd_warning{display:none;}</style><div class=\"pd_warning\"><em>Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter</em></div>\n <div class=\"pd_save is-viewer-good\" style=\"padding-right:10px;text-align: center;line-height:initial !important;font-size: xx-large;font-weight: 500;color: coral;\">\n Pets in this bar chart,by %\n </div>\n <div id=\"chartFigure35f3bc1b\" class=\"pd_save is-viewer-good\" style=\"overflow-x:auto\">\n \n \n <center><img style=\"max-width:initial !important\" src=\"\" class=\"pd_save\"></center>\n \n \n \n </div>",
"text/plain": "<IPython.core.display.HTML object>"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "After running the cell above, you should see a Spark DataFrame displayed as a **bar chart**, along with some controls to tweak the display. All that came from passing the DataFrame variable to `display()`.\n\nIn the next cell, you'll pass more interesting data to `display()`, which will also offer more advanced controls."
},
{
"metadata": {},
"cell_type": "raw",
"source": ""
},
{
"metadata": {},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
},
{
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "SUM",
"chartsize": "78",
"clusterby": "year",
"handlerId": "barChart",
"keyFields": "category",
"mpld3": "false",
"rendererId": "matplotlib",
"rowCount": "100",
"title": "Customers by Category clustered by Year",
"valueFields": "unique_customers"
}
}
},
"cell_type": "code",
"source": "# create another DataFrame, in a new variable\ndf2 = spark.createDataFrame(\n[(2010, 'Camping Equipment', 3),\n (2010, 'Golf Equipment', 1),\n (2010, 'Mountaineering Equipment', 1),\n (2010, 'Outdoor Protection', 2),\n (2010, 'Personal Accessories', 2),\n (2011, 'Camping Equipment', 4),\n (2011, 'Golf Equipment', 5),\n (2011, 'Mountaineering Equipment',2),\n (2011, 'Outdoor Protection', 4),\n (2011, 'Personal Accessories', 2),\n (2012, 'Camping Equipment', 5),\n (2012, 'Golf Equipment', 5),\n (2012, 'Mountaineering Equipment', 3),\n (2012, 'Outdoor Protection', 5),\n (2012, 'Personal Accessories', 3),\n (2013, 'Camping Equipment', 8),\n (2013, 'Golf Equipment', 5),\n (2013, 'Mountaineering Equipment', 3),\n (2013, 'Outdoor Protection', 8),\n (2013, 'Personal Accessories', 4)],\n[\"year\",\"category\",\"unique_customers\"])\n\n# This time, we've combined the dataframe and display() call in the same cell\n# Run this cell \ndisplay(df2)",
"execution_count": 4,
"outputs": [
{
"data": {
"text/html": "<style type=\"text/css\">.pd_warning{display:none;}</style><div class=\"pd_warning\"><em>Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter</em></div>\n <div class=\"pd_save is-viewer-good\" style=\"padding-right:10px;text-align: center;line-height:initial !important;font-size: xx-large;font-weight: 500;color: coral;\">\n Customers by Category clustered by Year\n </div>\n <div id=\"chartFiguree5656a27\" class=\"pd_save is-viewer-good\" style=\"overflow-x:auto\">\n \n \n <center><img style=\"max-width:initial !important\" src=\"\" class=\"pd_save\"></center>\n \n \n \n </div>",
"text/plain": "<IPython.core.display.HTML object>"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": ""
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Load External Data\nSo far, you've worked with data hard-coded into our notebook. Now, load external data (CSV) from a URL."
},
{
"metadata": {
"pixiedust": {
"displayParams": {
"chartsize": "50",
"color": "origin",
"handlerId": "scatterPlot",
"keyFields": "horsepower",
"kind": "hex",
"rendererId": "matplotlib",
"rowCount": "1000",
"title": "Distribution of MPG per Horsepower",
"valueFields": "mpg"
}
}
},
"cell_type": "code",
"source": "\n# load a CSV with pixiedust.sampleData()\ndf3 = pixiedust.sampleData(\"https://github.com/ibm-watson-data-lab/open-data/raw/master/cars/cars.csv\")\ndisplay(df3)",
"execution_count": 5,
"outputs": [
{
"data": {
"text/html": "<style type=\"text/css\">.pd_warning{display:none;}</style><div class=\"pd_warning\"><em>Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter</em></div>\n <div class=\"pd_save is-viewer-good\" style=\"padding-right:10px;text-align: center;line-height:initial !important;font-size: xx-large;font-weight: 500;color: coral;\">\n Distribution of MPG per Horsepower\n </div>\n <div id=\"chartFiguref92922ad\" class=\"pd_save is-viewer-good\" style=\"overflow-x:auto\">\n \n \n <center><img style=\"max-width:initial !important\" src=\"\" class=\"pd_save\"></center>\n \n \n \n </div>",
"text/plain": "<IPython.core.display.HTML object>"
},
"metadata": {},
"output_type": "display_data"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "You should see a scatterplot above, rendered again by matplotlib. Find the `Renderer` menu at top-right. You should see options for **Bokeh** and **Seaborn**. If you don't see Seaborn, it's not installed on your system. No problem, just install it by running the next cell."
},
{
"metadata": {
"scrolled": true
},
"cell_type": "code",
"source": "# To install Seaborn, uncomment the next line, and then run this cell\n#!pip install --user seaborn",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "*If you installed Seaborn, you'll need to also restart your notebook kernel, and run the cell to `import pixiedust` again. Find **Restart** in the **Kernel** menu above.*"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "End of chapter. [Return to table of contents](#toc)\n<hr>"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "\n# <a id=\"part_two\"></a>Load text data from remote sources\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Data files commonly reside in remote sources, such as such as public or private market places or GitHub repositories. You can load comma separated value (csv) data files using Pixiedust's `sampleData` method. "
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Prerequisites"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "If you haven't already, import PixieDust. Follow the instructions in [Get started](#part_one)."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Load data\n\nTo load a data set, run `pixiedust.sampleData` and specify the data set URL:"
},
{
"metadata": {},
"cell_type": "code",
"source": "homes = pixiedust.sampleData(\"https://raw.githubusercontent.com/ibm-watson-data-lab/open-data/master/homesales/milliondollarhomes.csv\")",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The `pixiedust.sampleData` method loads the data into an [Apache Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes), which you can inspect and visualize using `display()`."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Inspect and preview the loaded data\n\nTo inspect the automatically inferred schema and preview a small subset of the data, you can use the _DataFrame Table_ view, as shown in this preconfigured example: "
},
{
"metadata": {
"pixiedust": {
"displayParams": {
"handlerId": "tableView"
}
}
},
"cell_type": "code",
"source": "display(homes)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Simple visualization using bar charts\n\nWith PixieDust `display()`, you can visually explore the loaded data using built-in charts, such as, bar charts, line charts, scatter plots, or maps.\n\nTo explore a data set:\n* choose the desired chart type from the drop down\n* configure chart options\n* configure display options\n\nYou can analyze the average home price for each city by choosing: \n* chart type: bar chart\n* chart options\n * _Options > Keys_: `CITY`\n * _Options > Values_: `PRICE` \n * _Options > Aggregation_: `AVG`\n \nRun the next cell to review the results. "
},
{
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "AVG",
"chartsize": "51",
"handlerId": "barChart",
"keyFields": "CITY",
"legend": "true",
"mpld3": "false",
"rendererId": "matplotlib",
"rowCount": "100",
"stretch": "true",
"title": "Average home price by city",
"valueFields": "PRICE"
}
},
"scrolled": false
},
"cell_type": "code",
"source": "display(homes)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Explore the data\n\nYou can change the display **Options** so you can continue to explore the loaded data set without having to pre-process the data. \n\nFor example, change: \n* _Options > Key_ to `YEAR_BUILT` and \n* _Options > aggregation_ to `COUNT` \n\nNow you can find out how old the listed properties are:"
},
{
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "COUNT",
"chartsize": "50",
"handlerId": "barChart",
"keyFields": "YEAR BUILT",
"legend": "false",
"rowCount": "100",
"stretch": "true",
"title": "Property age",
"valueFields": "PRICE"
}
}
},
"cell_type": "code",
"source": "display(homes)",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Use sample data sets\n\nPixieDust comes with a set of curated data sets that you can use get familiar with the different chart types and options. \n\nType `pixiedust.sampleData()` to display those data sets."
},
{
"metadata": {},
"cell_type": "code",
"source": "pixiedust.sampleData()",
"execution_count": null,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The homes sales data set you loaded earlier is one of the samples. Therefore, you could have loaded it by specifying the displayed data set id as parameter: `home = pixiedust.sampleData(6)`"
},
{
"metadata": {
"collapsed": true
},
"cell_type": "markdown",
"source": "If your data isn't stored in csv files, you can load it into a DataFrame from any supported Spark [data source](https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources). See [these Python code snippets](https://apsportal.ibm.com/docs/content/analyze-data/python_load.html) for more information."
},
{
"metadata": {
"collapsed": true
},
"cell_type": "markdown",
"source": "End of chapter. [Return to table of contents](#toc)\n<hr>"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# <a id=\"contribute\"></a>Contribute\n\nBy now, you've walked through PixieDust's intro notebooks and seen PixieDust in action. If you like what you saw, join [the project](https://github.com/ibm-watson-data-lab/pixiedust)! \n\nAnyone can get involved. Here are some ways you can [contribute](https://ibm-watson-data-lab.github.io/pixiedust/contribute.html):\n\n - [Write a visualization](#Write-a-visualization)\n - [Build a renderer](#Build-a-renderer)\n - [Enter an issue](#Enter-an-issue)\n - [Share PixieDust](#Share-PixieDust)\n - [Learn more](#Learn-more)\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "End of chapter. [Return to table of contents](#toc)\n\n## Authors\n* Jose Barbosa\n* Mike Broberg\n* Inge Halilovic\n* Jess Mantaro\n* Brad Noble\n* David Taieb\n* Patrick Titzler\n\n<hr>\nCopyright &copy; IBM Corp. 2017, 2018. This notebook and its source code are released under the terms of the MIT License."
}
],
"metadata": {
"celltoolbar": "Edit Metadata",
"kernelspec": {
"name": "python37",
"display_name": "Python 3.7 with Spark",
"language": "python3"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10",
"file_extension": ".py",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
}
},
"nbformat": 4,
"nbformat_minor": 1
}
@bcyoungV
Copy link
Author

In [1] instalação da livraria pixiedust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment