Skip to content

Instantly share code, notes, and snippets.

@FavioVazquez
Created February 7, 2019 17:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save FavioVazquez/07b688e1285cc852f6fb45ecb67034d8 to your computer and use it in GitHub Desktop.
Save FavioVazquez/07b688e1285cc852f6fb45ecb67034d8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Optimus"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From command line:\n",
"\n",
"```\n",
"pip install optimuspyspark\n",
"```\n",
"\n",
"from a notebook you can use:\n",
"\n",
"```\n",
"!pip install optimuspyspark\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A good thing to do here is to restart the kernel. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Checking the installation\n",
"from optimus import Optimus\n",
"op= Optimus(master=\"local\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame creation\n",
"\n",
"Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = op.create.df(\n",
" [\n",
" \"names\",\n",
" \"height(ft)\",\n",
" \"function\",\n",
" \"rank\",\n",
" \"weight(t)\",\n",
" \"japanese name\",\n",
" \"last position\",\n",
" \"attributes\"\n",
" ],\n",
" [\n",
" \n",
" (\"Optim'us\", 28.0, \"Leader\", 10, 4.3, [\"Inochi\", \"Convoy\"], \"19.442735,-99.201111\",[8.5344, 4300.0]),\n",
" (\"bumbl#ebéé \", 17.5, \"Espionage\", 7, 2.0, [\"Bumble\",\"Goldback\"], \"10.642707,-71.612534\",[5.334, 2000.0]),\n",
" (\"ironhide&\", 26.0, \"Security\", 7, 4.0, [\"Roadbuster\"], \"37.789563,-122.400356\",[7.9248, 4000.0]),\n",
" (\"Jazz\",13.0, \"First Lieutenant\", 8, 1.8, [\"Meister\"], \"33.670666,-117.841553\",[3.9624, 1800.0]),\n",
" (\"Megatron\",None, \"None\", None, 5.7, [\"Megatron\"], None,[None,5700.0]),\n",
" (\"Metroplex_)^$\",300.0 , \"Battle Station\", 8, None, [\"Metroflex\"],None,[91.44, None]),\n",
" \n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
"| names|height(ft)| function|rank|weight(t)| japanese name| last position| attributes|\n",
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
"| Optim'us| 28.0| Leader| 10| 4.3| [Inochi, Convoy]|19.442735,-99.201111|[8.5344, 4300.0]|\n",
"| bumbl#ebéé | 17.5| Espionage| 7| 2.0|[Bumble, Goldback]|10.642707,-71.612534| [5.334, 2000.0]|\n",
"| ironhide&| 26.0| Security| 7| 4.0| [Roadbuster]|37.789563,-122.40...|[7.9248, 4000.0]|\n",
"| Jazz| 13.0|First Lieutenant| 8| 1.8| [Meister]|33.670666,-117.84...|[3.9624, 1800.0]|\n",
"| Megatron| null| None|null| 5.7| [Megatron]| null| [, 5700.0]|\n",
"|Metroplex_)^$| 300.0| Battle Station| 8| null| [Metroflex]| null| [91.44,]|\n",
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
"\n"
]
}
],
"source": [
"# View the data with plain spark\n",
"df.show() # kinda ugly"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatypes. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html\n",
"\n",
"Also you can use some Optimus predefined types:\n",
"\n",
"- \"str\" = StringType()\n",
"- \"int\" = IntegerType()\n",
"- \"float\" = FloatType()\n",
"- \"bool\" = BoleanType()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
"| names|height(ft)| function|rank|weight(t)| japanese name| last position| attributes|\n",
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
"| Optim'us| 28.0| Leader| 10| 4.3| [Inochi, Convoy]|19.442735,-99.201111|[8.5344, 4300.0]|\n",
"| bumbl#ebéé | 17.5| Espionage| 7| 2.0|[Bumble, Goldback]|10.642707,-71.612534| [5.334, 2000.0]|\n",
"| ironhide&| 26.0| Security| 7| 4.0| [Roadbuster]|37.789563,-122.40...|[7.9248, 4000.0]|\n",
"| Jazz| 13.0|First Lieutenant| 8| 1.8| [Meister]|33.670666,-117.84...|[3.9624, 1800.0]|\n",
"| Megatron| null| None|null| 5.7| [Megatron]| null| [, 5700.0]|\n",
"|Metroplex_)^$| 300.0| Battle Station| 8| null| [Metroflex]| null| [91.44,]|\n",
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
"\n"
]
}
],
"source": [
"f = op.create.df(\n",
" [\n",
" (\"names\", \"str\"),\n",
" (\"height\", \"float\"),\n",
" (\"function\", \"str\"),\n",
" (\"rank\", \"int\"),\n",
" ],\n",
" [\n",
" (\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n",
" (\"Optim'us\", 28.0, \"Leader\", 10),\n",
" (\"ironhide&\", 26.0, \"Security\", 7),\n",
" (\"Jazz\",13.0, \"First Lieutenant\", 8),\n",
" (\"Megatron\",None, \"None\", None),\n",
" \n",
" ])\n",
"df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a Daframe using a pandas dataframe"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
" .data_type {\n",
" font-size: 0.8em;\n",
" font-weight: normal;\n",
" }\n",
"\n",
" .column_name {\n",
" font-size: 1.2em;\n",
" }\n",
"\n",
" .info_items {\n",
" margin: 10px 0;\n",
" font-size: 0.8em;\n",
" }\n",
"\n",
" .optimus_table tr:nth-child(even) {\n",
" background-color: #f2f2f2 !important;\n",
" }\n",
"\n",
" .optimus_table tr:nth-child(odd) {\n",
" background-color: #ffffff !important;\n",
" }\n",
"\n",
" .optimus_table thead {\n",
" border-bottom: 1px solid black;\n",
" }\n",
" .optimus_table{\n",
" font-size: 12px;\n",
" }\n",
"\n",
"</style>\n",
"\n",
"\n",
"\n",
"\n",
"<div class=\"info_items\">Viewing 3 of 3 rows / 4 columns</div>\n",
"<div class=\"info_items\">1 partition(s)</div>\n",
"\n",
"<table class=\"optimus_table\">\n",
" <thead>\n",
" <tr>\n",
" \n",
" <th>\n",
" <div class=\"column_name\">names</div>\n",
" <div class=\"data_type\">1 (string)</div>\n",
" <div class=\"data_type\">\n",
" \n",
" nullable\n",
" \n",
" </div>\n",
" </th>\n",
" \n",
" <th>\n",
" <div class=\"column_name\">height</div>\n",
" <div class=\"data_type\">2 (double)</div>\n",
" <div class=\"data_type\">\n",
" \n",
" nullable\n",
" \n",
" </div>\n",
" </th>\n",
" \n",
" <th>\n",
" <div class=\"column_name\">function</div>\n",
" <div class=\"data_type\">3 (string)</div>\n",
" <div class=\"data_type\">\n",
" \n",
" nullable\n",
" \n",
" </div>\n",
" </th>\n",
" \n",
" <th>\n",
" <div class=\"column_name\">rank</div>\n",
" <div class=\"data_type\">4 (bigint)</div>\n",
" <div class=\"data_type\">\n",
" \n",
" nullable\n",
" \n",
" </div>\n",
" </th>\n",
" \n",
" </tr>\n",
"\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr>\n",
" \n",
" <td>\n",
" bumbl#ebéé⸱⸱\n",
" </td>\n",
" \n",
" <td>\n",
" Espionage\n",
" </td>\n",
" \n",
" <td>\n",
" 17.5\n",
" </td>\n",
" \n",
" <td>\n",
" 7\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" \n",
" <td>\n",
" Optim&#39;us\n",
" </td>\n",
" \n",
" <td>\n",
" Leader\n",
" </td>\n",
" \n",
" <td>\n",
" 28.0\n",
" </td>\n",
" \n",
" <td>\n",
" 10\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" <tr>\n",
" \n",
" <td>\n",
" ironhide&amp;\n",
" </td>\n",
" \n",
" <td>\n",
" Security\n",
" </td>\n",
" \n",
" <td>\n",
" 26.0\n",
" </td>\n",
" \n",
" <td>\n",
" 7\n",
" </td>\n",
" \n",
" </tr>\n",
" \n",
" </tbody>\n",
"</table>\n",
"\n",
"<div class=\"info_items\">Viewing 3 of 3 rows / 4 columns</div>\n",
"<div class=\"info_items\">1 partition(s)</div>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"data = [(\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n",
" (\"Optim'us\", 28.0, \"Leader\", 10),\n",
" (\"ironhide&\", 26.0, \"Security\", 7)]\n",
"labels = [\"names\", \"height\", \"function\", \"rank\"]\n",
"\n",
"# Create pandas dataframe\n",
"pdf = pd.DataFrame.from_records(data, columns=labels)\n",
"\n",
"df = op.create.df(pdf = pdf)\n",
"df.table()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Columns and Rows\n",
"\n",
"Optimus organized operations in columns and rows. This is a little different of how pandas works in which all operations are around the pandas class. We think this approach can better help you to access and transform data."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+------+------------+----+\n",
"| function|height| names|rank|\n",
"+---------+------+------------+----+\n",
"|Espionage| 17.5|bumbl#ebéé | 7|\n",
"| Leader| 28.0| Optim'us| 10|\n",
"| Security| 26.0| ironhide&| 7|\n",
"+---------+------+------------+----+\n",
"\n"
]
}
],
"source": [
"df.cols.sort().show()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+\n",
"| names|height| function|rank|\n",
"+------------+------+---------+----+\n",
"| Optim'us| 28.0| Leader| 10|\n",
"|bumbl#ebéé | 17.5|Espionage| 7|\n",
"| ironhide&| 26.0| Security| 7|\n",
"+------------+------+---------+----+\n",
"\n"
]
}
],
"source": [
"df.rows.sort(\"rank\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selection\n",
"\n",
"Unlike Pandas, Spark DataFrames don't support random row access. So methods like loc in pandas are not available.\n",
"\n",
"Also Pandas don't handle indexes. So methods like iloc are not available."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+\n",
"| names|\n",
"+------------+\n",
"|bumbl#ebéé |\n",
"| Optim'us|\n",
"| ironhide&|\n",
"+------------+\n",
"\n"
]
}
],
"source": [
"# Select an show an specific column\n",
"df.cols.select(\"names\").show()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+\n",
"| names|\n",
"+------------+\n",
"|bumbl#ebéé |\n",
"| Optim'us|\n",
"| ironhide&|\n",
"+------------+\n",
"\n"
]
}
],
"source": [
"# This works with plain Spark too\n",
"df.select(\"names\").show()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+--------+----+\n",
"| names|height|function|rank|\n",
"+--------+------+--------+----+\n",
"|Optim'us| 28.0| Leader| 10|\n",
"+--------+------+--------+----+\n",
"\n"
]
}
],
"source": [
"# Select rows from a DataFrame where a the condition is meet\n",
"df.rows.select(df[\"rank\"]>7).show()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+------+--------+----+\n",
"| names|height|function|rank|\n",
"+--------+------+--------+----+\n",
"|Optim'us| 28.0| Leader| 10|\n",
"+--------+------+--------+----+\n",
"\n"
]
}
],
"source": [
"# This is something like a filter in Spark\n",
"df.filter(df[\"rank\"]>7).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This may seem weird, why we are overriding some functions, but it was needed to add more functionalities that Spark won't handle."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+---+\n",
"| names|height| function|rank| id|\n",
"+------------+------+---------+----+---+\n",
"|bumbl#ebéé | 17.5|Espionage| 7| 0|\n",
"| Optim'us| 28.0| Leader| 10| 1|\n",
"| ironhide&| 26.0| Security| 7| 2|\n",
"+------------+------+---------+----+---+\n",
"\n"
]
}
],
"source": [
"# Creating an id for a DF\n",
"df.create_id().show()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+-------+\n",
"| names|height| function|rank|new_col|\n",
"+------------+------+---------+----+-------+\n",
"|bumbl#ebéé | 17.5|Espionage| 7| 1|\n",
"| Optim'us| 28.0| Leader| 10| 1|\n",
"| ironhide&| 26.0| Security| 7| 1|\n",
"+------------+------+---------+----+-------+\n",
"\n"
]
}
],
"source": [
"# Creating new columns\n",
"\n",
"df.cols.append(\"new_col\",1).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In plain Spark to do the same you will need to do:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+-------+\n",
"| names|height| function|rank|new_col|\n",
"+------------+------+---------+----+-------+\n",
"|bumbl#ebéé | 17.5|Espionage| 7| 1|\n",
"| Optim'us| 28.0| Leader| 10| 1|\n",
"| ironhide&| 26.0| Security| 7| 1|\n",
"+------------+------+---------+----+-------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import lit\n",
"df.withColumn(\"new_col\", lit(1)).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which is not straightforward at all. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a simple plot"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEHCAYAAACp5ActAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAFTVJREFUeJzt3X+w5WddH/D3h6zR2gARd0FINmzEUE2ZKrimoFZiodMEJctMqU0K1bSMGaEwiiiNYEOK1eGHA4Ux/oiV4YdICFTpti4NtcJQLMFswACbkLINCdkQSAiEkIkQVz7945wNJzd37948Obvn3OzrNXPnnu/zfc55PufMM+ee9/0+3++p7g4AAAD330MWXQAAAMBGJVABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQqAe1TVnqo6fdF1HClVta2quqo2Ddz3pKq6s6qOOZzjALDcBCqAo0RVXV9VT1/Rdm5VfejAdnf//e7+wCEeZ0OHg+lzfvMDfZzu/mx3H9fdfzuHmi6sqj9c0faBoyncAmxUAhUAS2WjBjUAjk4CFQD3mD2KVVWnVdXuqrqjqr5QVa+bdvvg9Pft0yVvT6mqh1TVr1bVDVV1S1W9taoePvO4Pz3dd1tV/fsV41xYVe+uqj+sqjuSnDsd+8NVdXtV3VxVv1VVx848XlfVC6rq01X11ar6tap6XFX9n2m9l872X4fnVNVnq+qLVfXymXEeUlXnV9X/m9Z+aVU9YrrvXkfqqurkqvrgtJ4/q6qLVh51Wm2cqjojycuS/Ivp63nV/agbgAUTqAA4mDckeUN3PyzJ45JcOm3/senv46dL3j6c5Nzpz48n+e4kxyX5rSSpqlOT/HaS5yR5dJKHJzlhxVg7krw7yfFJ3p7kb5O8OMnmJE9J8rQkL1hxn3+a5AeTPDnJS5NcnOS5SbYmeUKSc1Z7Ut395u4+d0Xzjyb5e9NxLqiq75u2vyjJs5I8Ncljknw5yUWrPW6SP0ryl0m+M8mFSf7VKn3uM053/48kv5HkndPX8/undZ5+qOWXACyeQAVwdHnP9KjP7VV1eyZB52D+Jsn3VNXm7r6zuy9fo+9zkryuu6/r7juT/EqSs6dHb56d5L9194e6++4kFyTpFff/cHe/p7u/0d1/3d1Xdvfl3b2/u69P8nuZhJpZr+nuO7p7T5JPJnnfdPyvJHlvkieu7yVJkvyH6bhXJbkqyfdP238uycu7e193fz2ToPTslcsSq+qkJD+U5ILuvru7P5Rk5/0YB4ANSqACOLo8q7uPP/CT+x71mfW8JI9P8qmquqKqfnKNvo9JcsPM9g1JNiV51HTfjQd2dPddSW5bcf8bZzeq6vFV9d+r6vPTZYC/kcnRqllfmLn916tsH7dGvSt9fub2XTP3fWySP5kJoNdkcvTsUSvu/5gkX5o+t1Wf0yHGAWCDEqgAWFV3f7q7z0nyyCSvTvLuqvq7ue/RpST5XCbh44CTkuzPJOTcnOTEAzuq6u9ksizuXsOt2P6dJJ9Kcsp0yeHLktT4sxl2Y5IzZ0Nod39bd9+0ot/NSR5RVd8+07b1foyz2msKwAYgUAGwqqp6blVt6e5vJLl92vyNJLdOf3/3TPd3JHnx9MIMx+Wb5wTtz+TcqGdW1Q9PLxRxYQ4djh6a5I4kd1bV9yZ5/rye1/30u0l+vaoemyRVtaWqdqzs1N03JNmd5MKqOraqnpLkmfdjnC8k2VZV/i4DbDDeuAE4mDOS7KmqOzO5QMXZ0/N/7kry60n+YroU7slJ3pTkbZlcAfAzSb6WyQUdMj3H6UVJLsnkSM6dSW5J8vU1xv6lJP8yyVeT/H6Sd87/6a3LGzI5F+p9VfXVJJcn+YcH6fucTC6gcVuS/5hJzWs9x1nvmv6+rao+Ol4uAEdadVtlAMCRMz2CdXsmy/k+s+h6DpeqemeST3X3KxZdCwCHjyNUABx2VfXMqvr26TlYv5nkE0muX2xV81VVPzT9LqyHTL9bakeS9yy6LgAOL4EKgCNhRyYXrvhcklMyWT74YFsi8V1JPpDJksY3Jnl+d39soRUBcNhZ8gcAADDIESoAAIBBAhUAAMCgTYsaePPmzb1t27ZFDQ8AAHBQV1555Re7e8uh+i0sUG3bti27d+9e1PAAAAAHVVU3rKefJX8AAACDBCoAAIBBAhUAAMAggQoAAGDQIQNVVb2pqm6pqk8eZH9V1Ruram9VfbyqnjT/MgEAAJbPeo5QvTnJGWvsPzPJKdOf85L8zgMvCwAAYPkdMlB19weTfGmNLjuSvLUnLk9yfFU9el4FAgAALKt5nEN1QpIbZ7b3TdsAAAAe1I7oF/tW1XmZLAvMSSeddCSHXpdt5//poktYGte/6icWXQIAsEH5TPVNa32m8jp900b+7DmPI1Q3Jdk6s33itO0+uvvi7t7e3du3bNkyh6EBAAAWZx6BameSn55e7e/JSb7S3TfP4XEBAACW2iGX/FXVO5KcnmRzVe1L8ook35Ik3f27SXYleUaSvUnuSvKvD1exAAAAy+SQgaq7zznE/k7yb+dWEQAAwAYxjyV/AAAARyWBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAoHUFqqo6o6quraq9VXX+KvtPqqr3V9XHqurjVfWM+ZcKAACwXA4ZqKrqmCQXJTkzyalJzqmqU1d0+9Ukl3b3E5OcneS3510oAADAslnPEarTkuzt7uu6++4klyTZsaJPJ3nY9PbDk3xufiUCAAAsp/UEqhOS3DizvW/aNuvCJM+tqn1JdiV50WoPVFXnVdXuqtp96623DpQLAACwPOZ1UYpzkry5u09M8owkb6uq+zx2d1/c3du7e/uWLVvmNDQAAMBirCdQ3ZRk68z2idO2Wc9LcmmSdPeHk3xbks3zKBAAAGBZrSdQXZHklKo6uaqOzeSiEztX9PlskqclSVV9XyaBypo+AADgQe2Qgaq79yd5YZLLklyTydX89lTVK6vqrGm3lyT52aq6Ksk7kpzb3X24igYAAFgGm9bTqbt3ZXKxidm2C2ZuX53kR+ZbGgAAwHKb10UpAAAAjjoCFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAg9YVqKrqjKq6tqr2VtX5B+nzU1V1dVXtqao/mm+ZAAAAy2fToTpU1TFJLkryT5LsS3JFVe3s7qtn+pyS5FeS/Eh3f7mqHnm4CgYAAFgW6zlCdVqSvd19XXffneSSJDtW9PnZJBd195eTpLtvmW+ZAAAAy2c9geqEJDfObO+bts16fJLHV9VfVNXlVXXGag9UVedV1e6q2n3rrbeOVQwAALAk5nVRik1JTklyepJzkvx+VR2/slN3X9zd27t7+5YtW+Y0NAAAwGKsJ1DdlGTrzPaJ07ZZ+5Ls7O6/6e7PJPm/mQQsAACAB631BKorkpxSVSdX1bFJzk6yc0Wf92RydCpVtTmTJYDXzbFOAACApXPIQNXd+5O8MMllSa5Jcml376mqV1bVWdNulyW5raquTvL+JL/c3bcdrqIBAACWwSEvm54k3b0rya4VbRfM3O4kvzj9AQAAOCrM66IUAAAARx2BCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQesKVFV1RlVdW1V7q+r8Nfr9s6rqqto+vxIBAACW0yEDVVUdk+SiJGcmOTXJOVV16ir9Hprk55N8ZN5FAgAALKP1HKE6Lcne7r6uu+9OckmSHav0+7Ukr07ytTnWBwAAsLTWE6hOSHLjzPa+ads9qupJSbZ295/OsTYAAICl9oAvSlFVD0nyuiQvWUff86pqd1XtvvXWWx/o0AAAAAu1nkB1U5KtM9snTtsOeGiSJyT5QFVdn+TJSXaudmGK7r64u7d39/YtW7aMVw0AALAE1hOorkhySlWdXFXHJjk7yc4DO7v7K929ubu3dfe2JJcnOau7dx+WigEAAJbEIQNVd+9P8sIklyW5Jsml3b2nql5ZVWcd7gIBAACW1ab1dOruXUl2rWi74CB9T3/gZQEAACy/B3xRCgAAgKOVQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwKB1BaqqOqOqrq2qvVV1/ir7f7Gqrq6qj1fV/6qqx86/VAAAgOVyyEBVVcckuSjJmUlOTXJOVZ26otvHkmzv7n+Q5N1JXjPvQgEAAJbNeo5QnZZkb3df1913J7kkyY7ZDt39/u6+a7p5eZIT51smAADA8llPoDohyY0z2/umbQfzvCTvfSBFAQAAbASb5vlgVfXcJNuTPPUg+89Lcl6SnHTSSfMcGgAA4IhbzxGqm5Jsndk+cdp2L1X19CQvT3JWd399tQfq7ou7e3t3b9+yZctIvQAAAEtjPYHqiiSnVNXJVXVskrOT7JztUFVPTPJ7mYSpW+ZfJgAAwPI5ZKDq7v1JXpjksiTXJLm0u/dU1Sur6qxpt9cmOS7Ju6rqr6pq50EeDgAA4EFjXedQdfeuJLtWtF0wc/vpc64LAABg6a3ri30BAAC4L4EKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBB6wpUVXVGVV1bVXur6vxV9n9rVb1zuv8jVbVt3oUCAAAsm0MGqqo6JslFSc5McmqSc6rq1BXdnpfky939PUlen+TV8y4UAABg2aznCNVpSfZ293XdfXeSS5LsWNFnR5K3TG+/O8nTqqrmVyYAAMDyWU+gOiHJjTPb+6Ztq/bp7v1JvpLkO+dRIAAAwLLadCQHq6rzkpw33byzqq49kuNvAJuTfHHRRSRJWbS5kSzNvGFDMW8YYd4w4qieNz5Trc8qr9MyzJvHrqfTegLVTUm2zmyfOG1brc++qtqU5OFJblv5QN19cZKL11PY0aiqdnf39kXXwcZi3jDCvGGEecMI84YRG2nerGfJ3xVJTqmqk6vq2CRnJ9m5os/OJD8zvf3sJH/e3T2/MgEAAJbPIY9Qdff+qnphksuSHJPkTd29p6pemWR3d+9M8gdJ3lZVe5N8KZPQBQAA8KC2rnOountXkl0r2i6Yuf21JP98vqUdlSyHZIR5wwjzhhHmDSPMG0ZsmHlTVuYBAACMWc85VAAAAKxCoFqQqnpTVd1SVZ+cafuBqrq8qv6qqnZX1WmLrJHlUlVbq+r9VXV1Ve2pqp+ftj+iqv5nVX16+vs7Fl0ry2ONefPaqvpUVX28qv6kqo5fdK0sj4PNm5n9L6mqrqrNi6qR5bPWvKmqF03fc/ZU1WsWWSfLZY2/Uxvmc7ElfwtSVT+W5M4kb+3uJ0zb3pfk9d393qp6RpKXdvfpCyyTJVJVj07y6O7+aFU9NMmVSZ6V5NwkX+ruV1XV+Um+o7v/3QJLZYmsMW9OzOSKrPurJt/+Yd5wwMHmTXdfXVVbk/znJN+b5Ae7e9HfE8OSWOP95lFJXp7kJ7r761X1yO6+ZZG1sjzWmDf/KRvkc7EjVAvS3R/M5IqI92pO8rDp7Ycn+dwRLYql1t03d/dHp7e/muSaJCck2ZHkLdNub8nkTQiSHHzedPf7unv/tNvlmQQsSLLm+02SvD7JSzP5mwX3WGPePD/Jq7r769N9whT3WGPebJjPxeu6yh9HzC8kuayqfjOTsPvDC66HJVVV25I8MclHkjyqu2+e7vp8Jv8JhPtYMW9m/Zsk7zzS9bAxzM6bqtqR5KbuvqqqFloXy23F+81rk/yjqvr1JF9L8kvdfcXiqmNZrZg3G+ZzsSNUy+X5SV7c3VuTvDiT7/eCe6mq45L8lyS/0N13zO6bfqG2/xpzHwebN1X18iT7k7x9UbWxvGbnTSbz5GVJLljzThz1Vnm/2ZTkEUmenOSXk1xaEjkrrDJvNsznYoFqufxMkj+e3n5XkqU9+Y7FqKpvyeTN5u3dfWCufGG6/vjAOmRLKbiXg8ybVNW5SX4yyXPaCbWssMq8eVySk5NcVVXXZ7JM9KNV9V2Lq5Jlc5D3m31J/rgn/jLJN5K4oAn3OMi82TCfiwWq5fK5JE+d3v7HST69wFpYMtP/5v1Bkmu6+3Uzu3Zm8qaT6e//eqRrY3kdbN5U1RmZnAdzVnfftaj6WE6rzZvu/kR3P7K7t3X3tkw+JD+puz+/wFJZImv8nXpPkh+f9nl8kmOTuJgJSdacNxvmc7Gr/C1IVb0jyemZ/IfmC0lekeTaJG/I5ND415K8oLuvXFSNLJeq+tEk/zvJJzL5714yWX7zkSSXJjkpyQ1Jfqq7V17whKPUGvPmjUm+Nclt07bLu/vnjnyFLKODzZvu3jXT5/ok213ljwPWeL/5syRvSvIDSe7O5ByqP19IkSydNebNHdkgn4sFKgAAgEGW/AEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGPT/AcRQsAOVXnEbAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 864x360 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df.plot.hist(\"height\",10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yes now you can plot spark DF that easy. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading external data"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
"| id| firstName| lastName|billingId| product|price| birth|dummyCol|\n",
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
"| 1| Luis| Alvarez$$%!| 123| Cake| 10|1980/07/07| never|\n",
"| 2| André| Ampère| 423| piza| 8|1950/07/08| gonna|\n",
"| 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give|\n",
"| 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you|\n",
"| 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up|\n",
"| 6| Galileo| GALiLEI| 672| arepa| 5|1930/08/12| never|\n",
"| 7| CaRL| Ga%%%uss| 323| taco| 3|1970/07/13| gonna|\n",
"| 8| David| H$$$ilbert| 624| taaaccoo| 3|1950/07/14| let|\n",
"| 9| Johannes| KEPLER| 735| taco| 3|1920/04/22| you|\n",
"| 10| JaMES| M$$ax%%well| 875| taco| 3|1923/03/12| down|\n",
"| 11| Isaac| Newton| 992| pasta| 9|1999/02/15| never |\n",
"| 12| Emmy%%| Nöether$| 234| pasta| 9|1993/12/08| gonna|\n",
"| 13| Max!!!| Planck!!!| 111|hamburguer| 4|1994/01/04| run |\n",
"| 14| Fred| Hoy&&&le| 553| pizzza| 8|1997/06/27| around|\n",
"| 15|((( Heinrich )))))| Hertz| 116| pizza| 8|1956/11/30| and|\n",
"| 16| William| Gilbert###| 886| BEER| 2|1958/03/26| desert|\n",
"| 17| Marie| CURIE| 912| Rice| 1|2000/03/22| you|\n",
"| 18| Arthur| COM%%%pton| 812| 110790| 5|1899/01/01| #|\n",
"| 19| JAMES| Chadwick| 467| null| 10|1921/05/03| #|\n",
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
"\n"
]
}
],
"source": [
"df_csv = op.load.csv(\"foo.csv\", header=True)\n",
"df_csv.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading data from the Web"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
"| id| firstName| lastName|billingId| product|price| birth|dummyCol|\n",
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
"| 1| Luis| Alvarez$$%!| 123| Cake| 10|1980/07/07| never|\n",
"| 2| André| Ampère| 423| piza| 8|1950/07/08| gonna|\n",
"| 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give|\n",
"| 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you|\n",
"| 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up|\n",
"| 6| Galileo| GALiLEI| 672| arepa| 5|1930/08/12| never|\n",
"| 7| CaRL| Ga%%%uss| 323| taco| 3|1970/07/13| gonna|\n",
"| 8| David| H$$$ilbert| 624| taaaccoo| 3|1950/07/14| let|\n",
"| 9| Johannes| KEPLER| 735| taco| 3|1920/04/22| you|\n",
"| 10| JaMES| M$$ax%%well| 875| taco| 3|1923/03/12| down|\n",
"| 11| Isaac| Newton| 992| pasta| 9|1999/02/15| never |\n",
"| 12| Emmy%%| Nöether$| 234| pasta| 9|1993/12/08| gonna|\n",
"| 13| Max!!!| Planck!!!| 111|hamburguer| 4|1994/01/04| run |\n",
"| 14| Fred| Hoy&&&le| 553| pizzza| 8|1997/06/27| around|\n",
"| 15|((( Heinrich )))))| Hertz| 116| pizza| 8|1956/11/30| and|\n",
"| 16| William| Gilbert###| 886| BEER| 2|1958/03/26| desert|\n",
"| 17| Marie| CURIE| 912| Rice| 1|2000/03/22| you|\n",
"| 18| Arthur| COM%%%pton| 812| 110790| 5|1899/01/01| #|\n",
"| 19| JAMES| Chadwick| 467| null| 10|1921/05/03| #|\n",
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
"\n"
]
}
],
"source": [
"df_url =op.load.url(\"https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv\")\n",
"df_url.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Joins, merge, concat, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Optimus provides and intuitive way to concat Dataframes by columns or rows. Operations like `join` and `group` are handle using Spark directly"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+\n",
"| names|height| function|rank|\n",
"+------------+------+---------+----+\n",
"|bumbl#ebéé | 17.5|Espionage| 7|\n",
"| Optim'us| 28.0| Leader| 10|\n",
"| ironhide&| 26.0| Security| 7|\n",
"+------------+------+---------+----+\n",
"\n"
]
}
],
"source": [
"# We have this dataframe\n",
"df.show()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Let's add a new column\n",
"df_new = op.create.df(\n",
" [\n",
" \"try\"\n",
" ],\n",
" [\n",
" (\"one\"),\n",
" (\"two\"),\n",
" (\"three\") \n",
" ])\n",
"\n",
"df_new = op.append([df,df_new], \"columns\")"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+-----+\n",
"| names|height| function|rank| try|\n",
"+------------+------+---------+----+-----+\n",
"|bumbl#ebéé | 17.5|Espionage| 7| one|\n",
"| Optim'us| 28.0| Leader| 10| two|\n",
"| ironhide&| 26.0| Security| 7|three|\n",
"+------------+------+---------+----+-----+\n",
"\n"
]
}
],
"source": [
"df_new.show()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"root\n",
" |-- names: string (nullable = true)\n",
" |-- height: double (nullable = true)\n",
" |-- function: string (nullable = true)\n",
" |-- rank: long (nullable = true)\n",
" |-- try: string (nullable = true)\n",
"\n"
]
}
],
"source": [
"df_new.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------------+------+---------+----+-------+\n",
"| names|height| function|rank| try|\n",
"+------------+------+---------+----+-------+\n",
"|bumbl#ebéé | 17.5|Espionage| 7| one|\n",
"| Optim'us| 28.0| Leader| 10| two|\n",
"| ironhide&| 26.0| Security| 7| three|\n",
"| favio| 10.0| Leader| 1|new_row|\n",
"+------------+------+---------+----+-------+\n",
"\n"
]
}
],
"source": [
"# Let's add a new row\n",
"\n",
"new_row = [(\"favio\", 10, \"Leader\", 1, \"new_row\")]\n",
"df_new = df_new.rows.append(new_row).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Doing fun stuff with columns and rows"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Let's create a dataframe :)\n",
"\n",
"\n",
"from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType\n",
"\n",
"df = op.create.df(\n",
" [\n",
" (\"words\", \"str\", True),\n",
" (\"num\", \"int\", True),\n",
" (\"animals\", \"str\", True),\n",
" (\"thing\", StringType(), True),\n",
" (\"two strings\", StringType(), True),\n",
" (\"filter\", StringType(), True),\n",
" (\"num 2\", \"string\", True),\n",
" (\"col_array\", ArrayType(StringType()), True),\n",
" (\"col_int\", ArrayType(IntegerType()), True)\n",
"\n",
" ]\n",
",\n",
"[\n",
" (\" I like fish \", 1, \"dog\", \"housé\", \"cat-car\", \"a\",\"1\",[\"baby\", \"sorry\"],[1,2,3]),\n",
" (\" zombies\", 2, \"cat\", \"tv\", \"dog-tv\", \"b\",\"2\",[\"baby 1\", \"sorry 1\"],[3,4]),\n",
" (\"simpsons cat lady\", 2, \"frog\", \"table\",\"eagle-tv-plus\",\"1\",\"3\", [\"baby 2\", \"sorry 2\"], [5,6,7]),\n",
" (None, 3, \"eagle\", \"glass\", \"lion-pc\", \"c\",\"4\", [\"baby 3\", \"sorry 3\"] ,[7,8])\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| words|num|animals|thing| two strings|filter|num 2| col_array| col_int|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n"
]
}
],
"source": [
"df.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With Optimus you can select columns with index or name, and even combined"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+---+-------+-----+\n",
"| words|num|animals|thing|\n",
"+-------------------+---+-------+-----+\n",
"| I like fish | 1| dog|housé|\n",
"| zombies| 2| cat| tv|\n",
"|simpsons cat lady| 2| frog|table|\n",
"| null| 3| eagle|glass|\n",
"+-------------------+---+-------+-----+\n",
"\n"
]
}
],
"source": [
"columns = [\"words\", 1, \"animals\", 3]\n",
"df.cols.select(columns).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select columns with a Regex"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---+-----+\n",
"|num|num 2|\n",
"+---+-----+\n",
"| 1| 1|\n",
"| 2| 2|\n",
"| 2| 3|\n",
"| 3| 4|\n",
"+---+-----+\n",
"\n"
]
}
],
"source": [
"df.cols.select(\"n.*\", regex = True).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select all the columns of type string"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+-------------------+-------------+------+-----+-------+\n",
"|num 2| words| two strings|filter|thing|animals|\n",
"+-----+-------------------+-------------+------+-----+-------+\n",
"| 1| I like fish | cat-car| a|housé| dog|\n",
"| 2| zombies| dog-tv| b| tv| cat|\n",
"| 3|simpsons cat lady|eagle-tv-plus| 1|table| frog|\n",
"| 4| null| lion-pc| c|glass| eagle|\n",
"+-----+-------------------+-------------+------+-----+-------+\n",
"\n"
]
}
],
"source": [
"df.cols.select(\"*\", data_type = \"str\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename a columun"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| words|number|animals|thing| two strings|filter|num 2| col_array| col_int|\n",
"+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n"
]
}
],
"source": [
"df.cols.rename('num','number').show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename multiple columns and uppercase all the columns¶"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| WORDS|NUM|ANIMALS|THING| TWO STRINGS|FILTER|NUM 2| COL_ARRAY| COL_INT|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n"
]
}
],
"source": [
"df.cols.rename([('num','number'),(\"animals\",\"gods\")], str.upper).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert to uppercase"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| WORDS|NUM|ANIMALS|THING| TWO STRINGS|FILTER|NUM 2| COL_ARRAY| COL_INT|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n"
]
}
],
"source": [
"df.cols.rename(str.upper).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort columns in alphabetical order"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n",
"|animals| col_array| col_int|filter|num|num 2|thing| two strings| words|\n",
"+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n",
"| dog| [baby, sorry]|[1, 2, 3]| a| 1| 1|housé| cat-car| I like fish |\n",
"| cat|[baby 1, sorry 1]| [3, 4]| b| 2| 2| tv| dog-tv| zombies|\n",
"| frog|[baby 2, sorry 2]|[5, 6, 7]| 1| 2| 3|table|eagle-tv-plus|simpsons cat lady|\n",
"| eagle|[baby 3, sorry 3]| [7, 8]| c| 3| 4|glass| lion-pc| null|\n",
"+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n",
"\n"
]
}
],
"source": [
"df.cols.sort(order = \"asc\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Drop multiple columns"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+-----+-------------+------+-----+-----------------+---------+\n",
"|animals|thing| two strings|filter|num 2| col_array| col_int|\n",
"+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n"
]
}
],
"source": [
"df.cols.drop([\"num\",\"words\"]).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort by row values"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| words|num|animals|thing| two strings|filter|num 2| col_array| col_int|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n"
]
}
],
"source": [
"df.rows.sort(\"animals\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select by row"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n",
"| words|num|animals|thing|two strings|filter|num 2| col_array| col_int|\n",
"+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n",
"| I like fish | 1| dog|housé| cat-car| a| 1|[baby, sorry]|[1, 2, 3]|\n",
"+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n",
"\n"
]
}
],
"source": [
"df.rows.select(df[\"num\"]==1).show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chaining\n",
"\n",
".cols y .rows attributes are used to organize and encapsulate optimus functionality apart from Apache Spark Dataframe API.\n",
"\n",
"At the same time it can be helpfull when you look at the code because every line is self explained.\n",
"\n",
"The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Orginal DF\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| words|num|animals|thing| two strings|filter|num 2| col_array| col_int|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n",
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n",
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n",
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n",
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
"\n",
"New DF\n",
"+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n",
"|two strings|thing|num 2|new_col_2|new_col_1|filter| col_int| col_array|animals|\n",
"+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n",
"| cat-car|housé| 1|spongebob| 1| a|[1, 2, 3]| [baby, sorry]| dog|\n",
"| dog-tv| tv| 2|spongebob| 1| b| [3, 4]|[baby 1, sorry 1]| cat|\n",
"| lion-pc|glass| 4|spongebob| 1| c| [7, 8]|[baby 3, sorry 3]| eagle|\n",
"+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n",
"\n"
]
}
],
"source": [
"print(\"Orginal DF\")\n",
"df.show()\n",
"\n",
"print(\"New DF\")\n",
"df\\\n",
" .cols.rename([('num','number')])\\\n",
" .cols.drop([\"number\",\"words\"])\\\n",
" .withColumn(\"new_col_2\", lit(\"spongebob\"))\\\n",
" .cols.append(\"new_col_1\", 1)\\\n",
" .cols.sort(order= \"desc\")\\\n",
" .rows.drop(df[\"num 2\"] == 3)\\\n",
" .show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment