ruby232/intro_to_pandas.ipynb

## intro_to_pandas.ipynb
{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Copia de intro_to_pandas.ipynb","version":"0.3.2","provenance":[{"file_id":"/v2/external/notebooks/mlcc/intro_to_pandas.ipynb","timestamp":1541773518150}],"collapsed_sections":["JndnmDMp66FL","YHIWvc9Ms-Ll","TJffr5_Jwqvd"]}},"cells":[{"metadata":{"colab_type":"text","id":"JndnmDMp66FL"},"cell_type":"markdown","source":["#### Copyright 2017 Google LLC."]},{"metadata":{"colab_type":"code","id":"hMqWDc_m6rUC","cellView":"both","colab":{}},"cell_type":"code","source":["# Licensed under the Apache License, Version 2.0 (the \"License\");\n","# you may not use this file except in compliance with the License.\n","# You may obtain a copy of the License at\n","#\n","# https://www.apache.org/licenses/LICENSE-2.0\n","#\n","# Unless required by applicable law or agreed to in writing, software\n","# distributed under the License is distributed on an \"AS IS\" BASIS,\n","# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n","# See the License for the specific language governing permissions and\n","# limitations under the License."],"execution_count":0,"outputs":[]},{"metadata":{"id":"hkTqyziS3NJE","colab_type":"code","colab":{}},"cell_type":"code","source":[""],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"rHLcriKWLRe4"},"cell_type":"markdown","source":["# Quick Introduction to pandas"]},{"metadata":{"colab_type":"text","id":"QvJBqX8_Bctk"},"cell_type":"markdown","source":["**Learning Objectives:**\n","  * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n","  * Access and manipulate data within a `DataFrame` and `Series`\n","  * Import CSV data into a *pandas* `DataFrame`\n","  * Reindex a `DataFrame` to shuffle data"]},{"metadata":{"colab_type":"text","id":"TIFJ83ZTBctl"},"cell_type":"markdown","source":["[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n","Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials."]},{"metadata":{"colab_type":"text","id":"s_JOISVgmn9v"},"cell_type":"markdown","source":["## Basic Concepts\n","\n","The following line imports the *pandas* API and prints the API version:"]},{"metadata":{"colab_type":"code","id":"aSRYu62xUi3g","colab":{}},"cell_type":"code","source":["from __future__ import print_function\n","\n","import pandas as pd\n","pd.__version__"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"daQreKXIUslr"},"cell_type":"markdown","source":["The primary data structures in *pandas* are implemented as two classes:\n","\n","  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n","  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n","\n","The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)."]},{"metadata":{"colab_type":"text","id":"fjnAk1xcU0yc"},"cell_type":"markdown","source":["One way to create a `Series` is to construct a `Series` object. For example:"]},{"metadata":{"colab_type":"code","id":"DFZ42Uq7UFDj","colab":{}},"cell_type":"code","source":["pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"U5ouUp1cU6pC"},"cell_type":"markdown","source":["`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:"]},{"metadata":{"colab_type":"code","id":"avgr6GfiUh8t","colab":{}},"cell_type":"code","source":["city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n","population = pd.Series([852469, 1015785, 485199])\n","\n","pd.DataFrame({ 'City name': city_names, 'Population': population })"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"oa5wfZT7VHJl"},"cell_type":"markdown","source":["But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:"]},{"metadata":{"colab_type":"code","id":"av6RYOraVG1V","colab":{}},"cell_type":"code","source":["california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n","california_housing_dataframe.describe()"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"WrkBjfz5kEQu"},"cell_type":"markdown","source":["The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:"]},{"metadata":{"colab_type":"code","id":"s3ND3bgOkB5k","colab":{}},"cell_type":"code","source":["california_housing_dataframe.head()"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"w9-Es5Y6laGd"},"cell_type":"markdown","source":["Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:"]},{"metadata":{"colab_type":"code","id":"nqndFVXVlbPN","colab":{}},"cell_type":"code","source":["california_housing_dataframe.hist('housing_median_age')"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"XtYZ7114n3b-"},"cell_type":"markdown","source":["## Accessing Data\n","\n","You can access `DataFrame` data using familiar Python dict/list operations:"]},{"metadata":{"colab_type":"code","id":"_TFm7-looBFF","colab":{}},"cell_type":"code","source":["cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n","print(type(cities['City name']))\n","cities['City name']"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"code","id":"V5L6xacLoxyv","colab":{}},"cell_type":"code","source":["print(type(cities['City name'][1]))\n","cities['City name'][1]"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"code","id":"gcYX1tBPugZl","colab":{}},"cell_type":"code","source":["print(type(cities[0:2]))\n","cities[0:2]"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"65g1ZdGVjXsQ"},"cell_type":"markdown","source":["In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here."]},{"metadata":{"colab_type":"text","id":"RM1iaD-ka3Y1"},"cell_type":"markdown","source":["## Manipulating Data\n","\n","You may apply Python's basic arithmetic operations to `Series`. For example:"]},{"metadata":{"colab_type":"code","id":"XWmyCFJ5bOv-","colab":{}},"cell_type":"code","source":["population / 1000."],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"TQzIVnbnmWGM"},"cell_type":"markdown","source":["[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:"]},{"metadata":{"colab_type":"code","id":"ko6pLK6JmkYP","colab":{}},"cell_type":"code","source":["import numpy as np\n","\n","np.log(population)"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"xmxFuQmurr6d"},"cell_type":"markdown","source":["For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n","`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n","\n","The example below creates a new `Series` that indicates whether `population` is over one million:"]},{"metadata":{"colab_type":"code","id":"Fc1DvPAbstjI","colab":{}},"cell_type":"code","source":["population.apply(lambda val: val > 1000000)"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"ZeYYLoV9b9fB"},"cell_type":"markdown","source":["\n","Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:"]},{"metadata":{"colab_type":"code","id":"0gCEX99Hb8LR","colab":{}},"cell_type":"code","source":["cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n","cities['Population density'] = cities['Population'] / cities['Area square miles']\n","cities"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"6qh63m-ayb-c"},"cell_type":"markdown","source":["## Exercise #1\n","\n","Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n","\n","  * The city is named after a saint.\n","  * The city has an area greater than 50 square miles.\n","\n","**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n","\n","**Hint:** \"San\" in Spanish means \"saint.\""]},{"metadata":{"colab_type":"code","id":"zCOn8ftSyddH","colab":{}},"cell_type":"code","source":["# Your code here"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"YHIWvc9Ms-Ll"},"cell_type":"markdown","source":["### Solution\n","\n","Click below for a solution."]},{"metadata":{"colab_type":"code","id":"T5OlrqtdtCIb","colab":{}},"cell_type":"code","source":["cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n","cities"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"f-xAOJeMiXFB"},"cell_type":"markdown","source":["## Indexes\n","Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n","\n","By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered."]},{"metadata":{"colab_type":"code","id":"2684gsWNinq9","colab":{}},"cell_type":"code","source":["city_names.index"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"code","id":"F_qPe2TBjfWd","colab":{}},"cell_type":"code","source":["cities.index"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"hp2oWY9Slo_h"},"cell_type":"markdown","source":["Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:"]},{"metadata":{"colab_type":"code","id":"sN0zUzSAj-U1","colab":{}},"cell_type":"code","source":["cities.reindex([2, 0, 1])"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"-GQFz8NZuS06"},"cell_type":"markdown","source":["Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n","Try running the following cell multiple times!"]},{"metadata":{"colab_type":"code","id":"mF8GC0k8uYhz","colab":{}},"cell_type":"code","source":["cities.reindex(np.random.permutation(cities.index))"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"fSso35fQmGKb"},"cell_type":"markdown","source":["For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)."]},{"metadata":{"colab_type":"text","id":"8UngIdVhz8C0"},"cell_type":"markdown","source":["## Exercise #2\n","\n","The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?"]},{"metadata":{"colab_type":"code","id":"PN55GrDX0jzO","colab":{}},"cell_type":"code","source":["# Your code here"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"TJffr5_Jwqvd"},"cell_type":"markdown","source":["### Solution\n","\n","Click below for the solution."]},{"metadata":{"colab_type":"text","id":"8oSvi2QWwuDH"},"cell_type":"markdown","source":["If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:"]},{"metadata":{"colab_type":"code","id":"yBdkucKCwy4x","colab":{}},"cell_type":"code","source":["cities.reindex([0, 4, 5, 2])"],"execution_count":0,"outputs":[]},{"metadata":{"colab_type":"text","id":"2l82PhPbwz7g"},"cell_type":"markdown","source":["This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n","documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n","in which the index values are browser names).\n","\n","In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n","sanitizing the input."]}]}