FavioVazquez/intro_optimus.ipynb

## intro_optimus.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Optimus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From command line:\n",
    "\n",
    "```\n",
    "pip install optimuspyspark\n",
    "```\n",
    "\n",
    "from a notebook you can use:\n",
    "\n",
    "```\n",
    "!pip install optimuspyspark\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A good thing to do here is to restart the kernel. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Checking the installation\n",
    "from optimus import Optimus\n",
    "op= Optimus(master=\"local\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DataFrame creation\n",
    "\n",
    "Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = op.create.df(\n",
    "    [\n",
    "        \"names\",\n",
    "        \"height(ft)\",\n",
    "        \"function\",\n",
    "        \"rank\",\n",
    "        \"weight(t)\",\n",
    "        \"japanese name\",\n",
    "        \"last position\",\n",
    "        \"attributes\"\n",
    "    ],\n",
    "    [\n",
    "        \n",
    "        (\"Optim'us\", 28.0, \"Leader\", 10, 4.3, [\"Inochi\", \"Convoy\"], \"19.442735,-99.201111\",[8.5344, 4300.0]),\n",
    "        (\"bumbl#ebéé  \", 17.5, \"Espionage\", 7, 2.0, [\"Bumble\",\"Goldback\"], \"10.642707,-71.612534\",[5.334, 2000.0]),\n",
    "        (\"ironhide&\", 26.0, \"Security\", 7, 4.0, [\"Roadbuster\"], \"37.789563,-122.400356\",[7.9248, 4000.0]),\n",
    "        (\"Jazz\",13.0, \"First Lieutenant\", 8, 1.8, [\"Meister\"], \"33.670666,-117.841553\",[3.9624, 1800.0]),\n",
    "        (\"Megatron\",None, \"None\", None, 5.7, [\"Megatron\"], None,[None,5700.0]),\n",
    "        (\"Metroplex_)^$\",300.0 , \"Battle Station\", 8, None, [\"Metroflex\"],None,[91.44, None]),\n",
    "        \n",
    "    ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
      "|        names|height(ft)|        function|rank|weight(t)|     japanese name|       last position|      attributes|\n",
      "+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
      "|     Optim'us|      28.0|          Leader|  10|      4.3|  [Inochi, Convoy]|19.442735,-99.201111|[8.5344, 4300.0]|\n",
      "| bumbl#ebéé  |      17.5|       Espionage|   7|      2.0|[Bumble, Goldback]|10.642707,-71.612534| [5.334, 2000.0]|\n",
      "|    ironhide&|      26.0|        Security|   7|      4.0|      [Roadbuster]|37.789563,-122.40...|[7.9248, 4000.0]|\n",
      "|         Jazz|      13.0|First Lieutenant|   8|      1.8|         [Meister]|33.670666,-117.84...|[3.9624, 1800.0]|\n",
      "|     Megatron|      null|            None|null|      5.7|        [Megatron]|                null|      [, 5700.0]|\n",
      "|Metroplex_)^$|     300.0|  Battle Station|   8|     null|       [Metroflex]|                null|        [91.44,]|\n",
      "+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# View the data with plain spark\n",
    "df.show() # kinda ugly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatypes. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html\n",
    "\n",
    "Also you can use some Optimus predefined types:\n",
    "\n",
    "- \"str\" = StringType()\n",
    "- \"int\" = IntegerType()\n",
    "- \"float\" = FloatType()\n",
    "- \"bool\" = BoleanType()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
      "|        names|height(ft)|        function|rank|weight(t)|     japanese name|       last position|      attributes|\n",
      "+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
      "|     Optim'us|      28.0|          Leader|  10|      4.3|  [Inochi, Convoy]|19.442735,-99.201111|[8.5344, 4300.0]|\n",
      "| bumbl#ebéé  |      17.5|       Espionage|   7|      2.0|[Bumble, Goldback]|10.642707,-71.612534| [5.334, 2000.0]|\n",
      "|    ironhide&|      26.0|        Security|   7|      4.0|      [Roadbuster]|37.789563,-122.40...|[7.9248, 4000.0]|\n",
      "|         Jazz|      13.0|First Lieutenant|   8|      1.8|         [Meister]|33.670666,-117.84...|[3.9624, 1800.0]|\n",
      "|     Megatron|      null|            None|null|      5.7|        [Megatron]|                null|      [, 5700.0]|\n",
      "|Metroplex_)^$|     300.0|  Battle Station|   8|     null|       [Metroflex]|                null|        [91.44,]|\n",
      "+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "f = op.create.df(\n",
    "    [\n",
    "        (\"names\", \"str\"),\n",
    "        (\"height\", \"float\"),\n",
    "        (\"function\", \"str\"),\n",
    "        (\"rank\", \"int\"),\n",
    "    ],\n",
    "    [\n",
    "        (\"bumbl#ebéé  \", 17.5, \"Espionage\", 7),\n",
    "        (\"Optim'us\", 28.0, \"Leader\", 10),\n",
    "        (\"ironhide&\", 26.0, \"Security\", 7),\n",
    "        (\"Jazz\",13.0, \"First Lieutenant\", 8),\n",
    "        (\"Megatron\",None, \"None\", None),\n",
    "        \n",
    "    ])\n",
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a Daframe using a pandas dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "    .data_type {\n",
       "        font-size: 0.8em;\n",
       "        font-weight: normal;\n",
       "    }\n",
       "\n",
       "    .column_name {\n",
       "        font-size: 1.2em;\n",
       "    }\n",
       "\n",
       "    .info_items {\n",
       "        margin: 10px 0;\n",
       "        font-size: 0.8em;\n",
       "    }\n",
       "\n",
       "    .optimus_table tr:nth-child(even) {\n",
       "        background-color: #f2f2f2 !important;\n",
       "    }\n",
       "\n",
       "    .optimus_table tr:nth-child(odd) {\n",
       "        background-color: #ffffff !important;\n",
       "    }\n",
       "\n",
       "    .optimus_table thead {\n",
       "        border-bottom: 1px solid black;\n",
       "    }\n",
       "    .optimus_table{\n",
       "        font-size: 12px;\n",
       "    }\n",
       "\n",
       "</style>\n",
       "\n",
       "\n",
       "\n",
       "\n",
       "<div class=\"info_items\">Viewing 3 of 3 rows / 4 columns</div>\n",
       "<div class=\"info_items\">1 partition(s)</div>\n",
       "\n",
       "<table class=\"optimus_table\">\n",
       "    <thead>\n",
       "    <tr>\n",
       "        \n",
       "        <th>\n",
       "            <div class=\"column_name\">names</div>\n",
       "            <div class=\"data_type\">1 (string)</div>\n",
       "            <div class=\"data_type\">\n",
       "                \n",
       "                nullable\n",
       "                \n",
       "            </div>\n",
       "        </th>\n",
       "        \n",
       "        <th>\n",
       "            <div class=\"column_name\">height</div>\n",
       "            <div class=\"data_type\">2 (double)</div>\n",
       "            <div class=\"data_type\">\n",
       "                \n",
       "                nullable\n",
       "                \n",
       "            </div>\n",
       "        </th>\n",
       "        \n",
       "        <th>\n",
       "            <div class=\"column_name\">function</div>\n",
       "            <div class=\"data_type\">3 (string)</div>\n",
       "            <div class=\"data_type\">\n",
       "                \n",
       "                nullable\n",
       "                \n",
       "            </div>\n",
       "        </th>\n",
       "        \n",
       "        <th>\n",
       "            <div class=\"column_name\">rank</div>\n",
       "            <div class=\"data_type\">4 (bigint)</div>\n",
       "            <div class=\"data_type\">\n",
       "                \n",
       "                nullable\n",
       "                \n",
       "            </div>\n",
       "        </th>\n",
       "        \n",
       "    </tr>\n",
       "\n",
       "    </thead>\n",
       "    <tbody>\n",
       "    \n",
       "    <tr>\n",
       "        \n",
       "        <td>\n",
       "            bumbl#ebéé⸱⸱\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            Espionage\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            17.5\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            7\n",
       "        </td>\n",
       "        \n",
       "    </tr>\n",
       "    \n",
       "    <tr>\n",
       "        \n",
       "        <td>\n",
       "            Optim&#39;us\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            Leader\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            28.0\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            10\n",
       "        </td>\n",
       "        \n",
       "    </tr>\n",
       "    \n",
       "    <tr>\n",
       "        \n",
       "        <td>\n",
       "            ironhide&amp;\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            Security\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            26.0\n",
       "        </td>\n",
       "        \n",
       "        <td>\n",
       "            7\n",
       "        </td>\n",
       "        \n",
       "    </tr>\n",
       "    \n",
       "    </tbody>\n",
       "</table>\n",
       "\n",
       "<div class=\"info_items\">Viewing 3 of 3 rows / 4 columns</div>\n",
       "<div class=\"info_items\">1 partition(s)</div>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "data = [(\"bumbl#ebéé  \", 17.5, \"Espionage\", 7),\n",
    "         (\"Optim'us\", 28.0, \"Leader\", 10),\n",
    "         (\"ironhide&\", 26.0, \"Security\", 7)]\n",
    "labels = [\"names\", \"height\", \"function\", \"rank\"]\n",
    "\n",
    "# Create pandas dataframe\n",
    "pdf = pd.DataFrame.from_records(data, columns=labels)\n",
    "\n",
    "df = op.create.df(pdf = pdf)\n",
    "df.table()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Columns and Rows\n",
    "\n",
    "Optimus organized operations in columns and rows. This is a little different of how pandas works in which all operations are around the pandas class. We think this approach can better help you to access and transform data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+------+------------+----+\n",
      "| function|height|       names|rank|\n",
      "+---------+------+------------+----+\n",
      "|Espionage|  17.5|bumbl#ebéé  |   7|\n",
      "|   Leader|  28.0|    Optim'us|  10|\n",
      "| Security|  26.0|   ironhide&|   7|\n",
      "+---------+------+------------+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.sort().show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+\n",
      "|       names|height| function|rank|\n",
      "+------------+------+---------+----+\n",
      "|    Optim'us|  28.0|   Leader|  10|\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|\n",
      "|   ironhide&|  26.0| Security|   7|\n",
      "+------------+------+---------+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.rows.sort(\"rank\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection\n",
    "\n",
    "Unlike Pandas, Spark DataFrames don't support random row access. So methods like loc in pandas are not available.\n",
    "\n",
    "Also Pandas don't handle indexes. So methods like iloc are not available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+\n",
      "|       names|\n",
      "+------------+\n",
      "|bumbl#ebéé  |\n",
      "|    Optim'us|\n",
      "|   ironhide&|\n",
      "+------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Select an show an specific column\n",
    "df.cols.select(\"names\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+\n",
      "|       names|\n",
      "+------------+\n",
      "|bumbl#ebéé  |\n",
      "|    Optim'us|\n",
      "|   ironhide&|\n",
      "+------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# This works with plain Spark too\n",
    "df.select(\"names\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+--------+----+\n",
      "|   names|height|function|rank|\n",
      "+--------+------+--------+----+\n",
      "|Optim'us|  28.0|  Leader|  10|\n",
      "+--------+------+--------+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Select rows from a DataFrame where a the condition is meet\n",
    "df.rows.select(df[\"rank\"]>7).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+------+--------+----+\n",
      "|   names|height|function|rank|\n",
      "+--------+------+--------+----+\n",
      "|Optim'us|  28.0|  Leader|  10|\n",
      "+--------+------+--------+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# This is something like a filter in Spark\n",
    "df.filter(df[\"rank\"]>7).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This may seem weird, why we are overriding some functions, but it was needed to add more functionalities that Spark won't handle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+---+\n",
      "|       names|height| function|rank| id|\n",
      "+------------+------+---------+----+---+\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|  0|\n",
      "|    Optim'us|  28.0|   Leader|  10|  1|\n",
      "|   ironhide&|  26.0| Security|   7|  2|\n",
      "+------------+------+---------+----+---+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Creating an id for a DF\n",
    "df.create_id().show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+-------+\n",
      "|       names|height| function|rank|new_col|\n",
      "+------------+------+---------+----+-------+\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|      1|\n",
      "|    Optim'us|  28.0|   Leader|  10|      1|\n",
      "|   ironhide&|  26.0| Security|   7|      1|\n",
      "+------------+------+---------+----+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Creating new columns\n",
    "\n",
    "df.cols.append(\"new_col\",1).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In plain Spark to do the same you will need to do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+-------+\n",
      "|       names|height| function|rank|new_col|\n",
      "+------------+------+---------+----+-------+\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|      1|\n",
      "|    Optim'us|  28.0|   Leader|  10|      1|\n",
      "|   ironhide&|  26.0| Security|   7|      1|\n",
      "+------------+------+---------+----+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import lit\n",
    "df.withColumn(\"new_col\", lit(1)).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which is not straightforward at all. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a simple plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEHCAYAAACp5ActAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAFTVJREFUeJzt3X+w5WddH/D3h6zR2gARd0FINmzEUE2ZKrimoFZiodMEJctMqU0K1bSMGaEwiiiNYEOK1eGHA4Ux/oiV4YdICFTpti4NtcJQLMFswACbkLINCdkQSAiEkIkQVz7945wNJzd37948Obvn3OzrNXPnnu/zfc55PufMM+ee9/0+3++p7g4AAAD330MWXQAAAMBGJVABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQqAe1TVnqo6fdF1HClVta2quqo2Ddz3pKq6s6qOOZzjALDcBCqAo0RVXV9VT1/Rdm5VfejAdnf//e7+wCEeZ0OHg+lzfvMDfZzu/mx3H9fdfzuHmi6sqj9c0faBoyncAmxUAhUAS2WjBjUAjk4CFQD3mD2KVVWnVdXuqrqjqr5QVa+bdvvg9Pft0yVvT6mqh1TVr1bVDVV1S1W9taoePvO4Pz3dd1tV/fsV41xYVe+uqj+sqjuSnDsd+8NVdXtV3VxVv1VVx848XlfVC6rq01X11ar6tap6XFX9n2m9l872X4fnVNVnq+qLVfXymXEeUlXnV9X/m9Z+aVU9YrrvXkfqqurkqvrgtJ4/q6qLVh51Wm2cqjojycuS/Ivp63nV/agbgAUTqAA4mDckeUN3PyzJ45JcOm3/senv46dL3j6c5Nzpz48n+e4kxyX5rSSpqlOT/HaS5yR5dJKHJzlhxVg7krw7yfFJ3p7kb5O8OMnmJE9J8rQkL1hxn3+a5AeTPDnJS5NcnOS5SbYmeUKSc1Z7Ut395u4+d0Xzjyb5e9NxLqiq75u2vyjJs5I8Ncljknw5yUWrPW6SP0ryl0m+M8mFSf7VKn3uM053/48kv5HkndPX8/undZ5+qOWXACyeQAVwdHnP9KjP7VV1eyZB52D+Jsn3VNXm7r6zuy9fo+9zkryuu6/r7juT/EqSs6dHb56d5L9194e6++4kFyTpFff/cHe/p7u/0d1/3d1Xdvfl3b2/u69P8nuZhJpZr+nuO7p7T5JPJnnfdPyvJHlvkieu7yVJkvyH6bhXJbkqyfdP238uycu7e193fz2ToPTslcsSq+qkJD+U5ILuvru7P5Rk5/0YB4ANSqACOLo8q7uPP/CT+x71mfW8JI9P8qmquqKqfnKNvo9JcsPM9g1JNiV51HTfjQd2dPddSW5bcf8bZzeq6vFV9d+r6vPTZYC/kcnRqllfmLn916tsH7dGvSt9fub2XTP3fWySP5kJoNdkcvTsUSvu/5gkX5o+t1Wf0yHGAWCDEqgAWFV3f7q7z0nyyCSvTvLuqvq7ue/RpST5XCbh44CTkuzPJOTcnOTEAzuq6u9ksizuXsOt2P6dJJ9Kcsp0yeHLktT4sxl2Y5IzZ0Nod39bd9+0ot/NSR5RVd8+07b1foyz2msKwAYgUAGwqqp6blVt6e5vJLl92vyNJLdOf3/3TPd3JHnx9MIMx+Wb5wTtz+TcqGdW1Q9PLxRxYQ4djh6a5I4kd1bV9yZ5/rye1/30u0l+vaoemyRVtaWqdqzs1N03JNmd5MKqOraqnpLkmfdjnC8k2VZV/i4DbDDeuAE4mDOS7KmqOzO5QMXZ0/N/7kry60n+YroU7slJ3pTkbZlcAfAzSb6WyQUdMj3H6UVJLsnkSM6dSW5J8vU1xv6lJP8yyVeT/H6Sd87/6a3LGzI5F+p9VfXVJJcn+YcH6fucTC6gcVuS/5hJzWs9x1nvmv6+rao+Ol4uAEdadVtlAMCRMz2CdXsmy/k+s+h6DpeqemeST3X3KxZdCwCHjyNUABx2VfXMqvr26TlYv5nkE0muX2xV81VVPzT9LqyHTL9bakeS9yy6LgAOL4EKgCNhRyYXrvhcklMyWT74YFsi8V1JPpDJksY3Jnl+d39soRUBcNhZ8gcAADDIESoAAIBBAhUAAMCgTYsaePPmzb1t27ZFDQ8AAHBQV1555Re7e8uh+i0sUG3bti27d+9e1PAAAAAHVVU3rKefJX8AAACDBCoAAIBBAhUAAMAggQoAAGDQIQNVVb2pqm6pqk8eZH9V1Ruram9VfbyqnjT/MgEAAJbPeo5QvTnJGWvsPzPJKdOf85L8zgMvCwAAYPkdMlB19weTfGmNLjuSvLUnLk9yfFU9el4FAgAALKt5nEN1QpIbZ7b3TdsAAAAe1I7oF/tW1XmZLAvMSSeddCSHXpdt5//poktYGte/6icWXQIAsEH5TPVNa32m8jp900b+7DmPI1Q3Jdk6s33itO0+uvvi7t7e3du3bNkyh6EBAAAWZx6BameSn55e7e/JSb7S3TfP4XEBAACW2iGX/FXVO5KcnmRzVe1L8ook35Ik3f27SXYleUaSvUnuSvKvD1exAAAAy+SQgaq7zznE/k7yb+dWEQAAwAYxjyV/AAAARyWBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAoHUFqqo6o6quraq9VXX+KvtPqqr3V9XHqurjVfWM+ZcKAACwXA4ZqKrqmCQXJTkzyalJzqmqU1d0+9Ukl3b3E5OcneS3510oAADAslnPEarTkuzt7uu6++4klyTZsaJPJ3nY9PbDk3xufiUCAAAsp/UEqhOS3DizvW/aNuvCJM+tqn1JdiV50WoPVFXnVdXuqtp96623DpQLAACwPOZ1UYpzkry5u09M8owkb6uq+zx2d1/c3du7e/uWLVvmNDQAAMBirCdQ3ZRk68z2idO2Wc9LcmmSdPeHk3xbks3zKBAAAGBZrSdQXZHklKo6uaqOzeSiEztX9PlskqclSVV9XyaBypo+AADgQe2Qgaq79yd5YZLLklyTydX89lTVK6vqrGm3lyT52aq6Ksk7kpzb3X24igYAAFgGm9bTqbt3ZXKxidm2C2ZuX53kR+ZbGgAAwHKb10UpAAAAjjoCFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAg9YVqKrqjKq6tqr2VtX5B+nzU1V1dVXtqao/mm+ZAAAAy2fToTpU1TFJLkryT5LsS3JFVe3s7qtn+pyS5FeS/Eh3f7mqHnm4CgYAAFgW6zlCdVqSvd19XXffneSSJDtW9PnZJBd195eTpLtvmW+ZAAAAy2c9geqEJDfObO+bts16fJLHV9VfVNXlVXXGag9UVedV1e6q2n3rrbeOVQwAALAk5nVRik1JTklyepJzkvx+VR2/slN3X9zd27t7+5YtW+Y0NAAAwGKsJ1DdlGTrzPaJ07ZZ+5Ls7O6/6e7PJPm/mQQsAACAB631BKorkpxSVSdX1bFJzk6yc0Wf92RydCpVtTmTJYDXzbFOAACApXPIQNXd+5O8MMllSa5Jcml376mqV1bVWdNulyW5raquTvL+JL/c3bcdrqIBAACWwSEvm54k3b0rya4VbRfM3O4kvzj9AQAAOCrM66IUAAAARx2BCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQesKVFV1RlVdW1V7q+r8Nfr9s6rqqto+vxIBAACW0yEDVVUdk+SiJGcmOTXJOVV16ir9Hprk55N8ZN5FAgAALKP1HKE6Lcne7r6uu+9OckmSHav0+7Ukr07ytTnWBwAAsLTWE6hOSHLjzPa+ads9qupJSbZ295/OsTYAAICl9oAvSlFVD0nyuiQvWUff86pqd1XtvvXWWx/o0AAAAAu1nkB1U5KtM9snTtsOeGiSJyT5QFVdn+TJSXaudmGK7r64u7d39/YtW7aMVw0AALAE1hOorkhySlWdXFXHJjk7yc4DO7v7K929ubu3dfe2JJcnOau7dx+WigEAAJbEIQNVd+9P8sIklyW5Jsml3b2nql5ZVWcd7gIBAACW1ab1dOruXUl2rWi74CB9T3/gZQEAACy/B3xRCgAAgKOVQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwKB1BaqqOqOqrq2qvVV1/ir7f7Gqrq6qj1fV/6qqx86/VAAAgOVyyEBVVcckuSjJmUlOTXJOVZ26otvHkmzv7n+Q5N1JXjPvQgEAAJbNeo5QnZZkb3df1913J7kkyY7ZDt39/u6+a7p5eZIT51smAADA8llPoDohyY0z2/umbQfzvCTvfSBFAQAAbASb5vlgVfXcJNuTPPUg+89Lcl6SnHTSSfMcGgAA4IhbzxGqm5Jsndk+cdp2L1X19CQvT3JWd399tQfq7ou7e3t3b9+yZctIvQAAAEtjPYHqiiSnVNXJVXVskrOT7JztUFVPTPJ7mYSpW+ZfJgAAwPI5ZKDq7v1JXpjksiTXJLm0u/dU1Sur6qxpt9cmOS7Ju6rqr6pq50EeDgAA4EFjXedQdfeuJLtWtF0wc/vpc64LAABg6a3ri30BAAC4L4EKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBB6wpUVXVGVV1bVXur6vxV9n9rVb1zuv8jVbVt3oUCAAAsm0MGqqo6JslFSc5McmqSc6rq1BXdnpfky939PUlen+TV8y4UAABg2aznCNVpSfZ293XdfXeSS5LsWNFnR5K3TG+/O8nTqqrmVyYAAMDyWU+gOiHJjTPb+6Ztq/bp7v1JvpLkO+dRIAAAwLLadCQHq6rzkpw33byzqq49kuNvAJuTfHHRRSRJWbS5kSzNvGFDMW8YYd4w4qieNz5Trc8qr9MyzJvHrqfTegLVTUm2zmyfOG1brc++qtqU5OFJblv5QN19cZKL11PY0aiqdnf39kXXwcZi3jDCvGGEecMI84YRG2nerGfJ3xVJTqmqk6vq2CRnJ9m5os/OJD8zvf3sJH/e3T2/MgEAAJbPIY9Qdff+qnphksuSHJPkTd29p6pemWR3d+9M8gdJ3lZVe5N8KZPQBQAA8KC2rnOountXkl0r2i6Yuf21JP98vqUdlSyHZIR5wwjzhhHmDSPMG0ZsmHlTVuYBAACMWc85VAAAAKxCoFqQqnpTVd1SVZ+cafuBqrq8qv6qqnZX1WmLrJHlUlVbq+r9VXV1Ve2pqp+ftj+iqv5nVX16+vs7Fl0ry2ONefPaqvpUVX28qv6kqo5fdK0sj4PNm5n9L6mqrqrNi6qR5bPWvKmqF03fc/ZU1WsWWSfLZY2/Uxvmc7ElfwtSVT+W5M4kb+3uJ0zb3pfk9d393qp6RpKXdvfpCyyTJVJVj07y6O7+aFU9NMmVSZ6V5NwkX+ruV1XV+Um+o7v/3QJLZYmsMW9OzOSKrPurJt/+Yd5wwMHmTXdfXVVbk/znJN+b5Ae7e9HfE8OSWOP95lFJXp7kJ7r761X1yO6+ZZG1sjzWmDf/KRvkc7EjVAvS3R/M5IqI92pO8rDp7Ycn+dwRLYql1t03d/dHp7e/muSaJCck2ZHkLdNub8nkTQiSHHzedPf7unv/tNvlmQQsSLLm+02SvD7JSzP5mwX3WGPePD/Jq7r769N9whT3WGPebJjPxeu6yh9HzC8kuayqfjOTsPvDC66HJVVV25I8MclHkjyqu2+e7vp8Jv8JhPtYMW9m/Zsk7zzS9bAxzM6bqtqR5KbuvqqqFloXy23F+81rk/yjqvr1JF9L8kvdfcXiqmNZrZg3G+ZzsSNUy+X5SV7c3VuTvDiT7/eCe6mq45L8lyS/0N13zO6bfqG2/xpzHwebN1X18iT7k7x9UbWxvGbnTSbz5GVJLljzThz1Vnm/2ZTkEUmenOSXk1xaEjkrrDJvNsznYoFqufxMkj+e3n5XkqU9+Y7FqKpvyeTN5u3dfWCufGG6/vjAOmRLKbiXg8ybVNW5SX4yyXPaCbWssMq8eVySk5NcVVXXZ7JM9KNV9V2Lq5Jlc5D3m31J/rgn/jLJN5K4oAn3OMi82TCfiwWq5fK5JE+d3v7HST69wFpYMtP/5v1Bkmu6+3Uzu3Zm8qaT6e//eqRrY3kdbN5U1RmZnAdzVnfftaj6WE6rzZvu/kR3P7K7t3X3tkw+JD+puz+/wFJZImv8nXpPkh+f9nl8kmOTuJgJSdacNxvmc7Gr/C1IVb0jyemZ/IfmC0lekeTaJG/I5ND415K8oLuvXFSNLJeq+tEk/zvJJzL5714yWX7zkSSXJjkpyQ1Jfqq7V17whKPUGvPmjUm+Nclt07bLu/vnjnyFLKODzZvu3jXT5/ok213ljwPWeL/5syRvSvIDSe7O5ByqP19IkSydNebNHdkgn4sFKgAAgEGW/AEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGPT/AcRQsAOVXnEbAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 864x360 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "df.plot.hist(\"height\",10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yes now you can plot spark DF that easy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading external data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
      "| id|           firstName|            lastName|billingId|   product|price|     birth|dummyCol|\n",
      "+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
      "|  1|                Luis|         Alvarez$$%!|      123|      Cake|   10|1980/07/07|   never|\n",
      "|  2|               André|              Ampère|      423|      piza|    8|1950/07/08|   gonna|\n",
      "|  3|               NiELS|          Böhr//((%%|      551|     pizza|    8|1990/07/09|    give|\n",
      "|  4|                PAUL|              dirac$|      521|     pizza|    8|1954/07/10|     you|\n",
      "|  5|              Albert|            Einstein|      634|     pizza|    8|1990/07/11|      up|\n",
      "|  6|             Galileo|             GALiLEI|      672|     arepa|    5|1930/08/12|   never|\n",
      "|  7|                CaRL|            Ga%%%uss|      323|      taco|    3|1970/07/13|   gonna|\n",
      "|  8|               David|          H$$$ilbert|      624|  taaaccoo|    3|1950/07/14|     let|\n",
      "|  9|            Johannes|              KEPLER|      735|      taco|    3|1920/04/22|     you|\n",
      "| 10|               JaMES|         M$$ax%%well|      875|      taco|    3|1923/03/12|    down|\n",
      "| 11|               Isaac|              Newton|      992|     pasta|    9|1999/02/15|  never |\n",
      "| 12|              Emmy%%|            Nöether$|      234|     pasta|    9|1993/12/08|   gonna|\n",
      "| 13|              Max!!!|           Planck!!!|      111|hamburguer|    4|1994/01/04|    run |\n",
      "| 14|                Fred|            Hoy&&&le|      553|    pizzza|    8|1997/06/27|  around|\n",
      "| 15|(((   Heinrich )))))|               Hertz|      116|     pizza|    8|1956/11/30|     and|\n",
      "| 16|             William|          Gilbert###|      886|      BEER|    2|1958/03/26|  desert|\n",
      "| 17|               Marie|               CURIE|      912|      Rice|    1|2000/03/22|     you|\n",
      "| 18|              Arthur|          COM%%%pton|      812|    110790|    5|1899/01/01|       #|\n",
      "| 19|               JAMES|            Chadwick|      467|      null|   10|1921/05/03|       #|\n",
      "+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_csv = op.load.csv(\"foo.csv\", header=True)\n",
    "df_csv.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data from the Web"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
      "| id|           firstName|            lastName|billingId|   product|price|     birth|dummyCol|\n",
      "+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
      "|  1|                Luis|         Alvarez$$%!|      123|      Cake|   10|1980/07/07|   never|\n",
      "|  2|               André|              Ampère|      423|      piza|    8|1950/07/08|   gonna|\n",
      "|  3|               NiELS|          Böhr//((%%|      551|     pizza|    8|1990/07/09|    give|\n",
      "|  4|                PAUL|              dirac$|      521|     pizza|    8|1954/07/10|     you|\n",
      "|  5|              Albert|            Einstein|      634|     pizza|    8|1990/07/11|      up|\n",
      "|  6|             Galileo|             GALiLEI|      672|     arepa|    5|1930/08/12|   never|\n",
      "|  7|                CaRL|            Ga%%%uss|      323|      taco|    3|1970/07/13|   gonna|\n",
      "|  8|               David|          H$$$ilbert|      624|  taaaccoo|    3|1950/07/14|     let|\n",
      "|  9|            Johannes|              KEPLER|      735|      taco|    3|1920/04/22|     you|\n",
      "| 10|               JaMES|         M$$ax%%well|      875|      taco|    3|1923/03/12|    down|\n",
      "| 11|               Isaac|              Newton|      992|     pasta|    9|1999/02/15|  never |\n",
      "| 12|              Emmy%%|            Nöether$|      234|     pasta|    9|1993/12/08|   gonna|\n",
      "| 13|              Max!!!|           Planck!!!|      111|hamburguer|    4|1994/01/04|    run |\n",
      "| 14|                Fred|            Hoy&&&le|      553|    pizzza|    8|1997/06/27|  around|\n",
      "| 15|(((   Heinrich )))))|               Hertz|      116|     pizza|    8|1956/11/30|     and|\n",
      "| 16|             William|          Gilbert###|      886|      BEER|    2|1958/03/26|  desert|\n",
      "| 17|               Marie|               CURIE|      912|      Rice|    1|2000/03/22|     you|\n",
      "| 18|              Arthur|          COM%%%pton|      812|    110790|    5|1899/01/01|       #|\n",
      "| 19|               JAMES|            Chadwick|      467|      null|   10|1921/05/03|       #|\n",
      "+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_url =op.load.url(\"https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv\")\n",
    "df_url.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Joins, merge, concat, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Optimus provides and intuitive way to concat Dataframes by columns or rows. Operations like `join` and `group` are handle using Spark directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+\n",
      "|       names|height| function|rank|\n",
      "+------------+------+---------+----+\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|\n",
      "|    Optim'us|  28.0|   Leader|  10|\n",
      "|   ironhide&|  26.0| Security|   7|\n",
      "+------------+------+---------+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# We have this dataframe\n",
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's add a new column\n",
    "df_new = op.create.df(\n",
    "    [\n",
    "        \"try\"\n",
    "    ],\n",
    "    [\n",
    "        (\"one\"),\n",
    "        (\"two\"),\n",
    "        (\"three\")  \n",
    "    ])\n",
    "\n",
    "df_new = op.append([df,df_new], \"columns\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+-----+\n",
      "|       names|height| function|rank|  try|\n",
      "+------------+------+---------+----+-----+\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|  one|\n",
      "|    Optim'us|  28.0|   Leader|  10|  two|\n",
      "|   ironhide&|  26.0| Security|   7|three|\n",
      "+------------+------+---------+----+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_new.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- names: string (nullable = true)\n",
      " |-- height: double (nullable = true)\n",
      " |-- function: string (nullable = true)\n",
      " |-- rank: long (nullable = true)\n",
      " |-- try: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_new.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+------+---------+----+-------+\n",
      "|       names|height| function|rank|    try|\n",
      "+------------+------+---------+----+-------+\n",
      "|bumbl#ebéé  |  17.5|Espionage|   7|    one|\n",
      "|    Optim'us|  28.0|   Leader|  10|    two|\n",
      "|   ironhide&|  26.0| Security|   7|  three|\n",
      "|       favio|  10.0|   Leader|   1|new_row|\n",
      "+------------+------+---------+----+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Let's add a new row\n",
    "\n",
    "new_row = [(\"favio\", 10, \"Leader\", 1, \"new_row\")]\n",
    "df_new = df_new.rows.append(new_row).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Doing fun stuff with columns and rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's create a dataframe :)\n",
    "\n",
    "\n",
    "from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType\n",
    "\n",
    "df = op.create.df(\n",
    "            [\n",
    "                (\"words\", \"str\", True),\n",
    "                (\"num\", \"int\", True),\n",
    "                (\"animals\", \"str\", True),\n",
    "                (\"thing\", StringType(), True),\n",
    "                (\"two strings\", StringType(), True),\n",
    "                (\"filter\", StringType(), True),\n",
    "                (\"num 2\", \"string\", True),\n",
    "                (\"col_array\",  ArrayType(StringType()), True),\n",
    "                (\"col_int\",  ArrayType(IntegerType()), True)\n",
    "\n",
    "            ]\n",
    ",\n",
    "[\n",
    "                (\"  I like     fish  \", 1, \"dog\", \"housé\", \"cat-car\", \"a\",\"1\",[\"baby\", \"sorry\"],[1,2,3]),\n",
    "                (\"    zombies\", 2, \"cat\", \"tv\", \"dog-tv\", \"b\",\"2\",[\"baby 1\", \"sorry 1\"],[3,4]),\n",
    "                (\"simpsons   cat lady\", 2, \"frog\", \"table\",\"eagle-tv-plus\",\"1\",\"3\", [\"baby 2\", \"sorry 2\"], [5,6,7]),\n",
    "                (None, 3, \"eagle\", \"glass\", \"lion-pc\", \"c\",\"4\", [\"baby 3\", \"sorry 3\"] ,[7,8])\n",
    "            ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|              words|num|animals|thing|  two strings|filter|num 2|        col_array|  col_int|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|               null|  3|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With Optimus you can select columns with index or name, and even combined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+---+-------+-----+\n",
      "|              words|num|animals|thing|\n",
      "+-------------------+---+-------+-----+\n",
      "|  I like     fish  |  1|    dog|housé|\n",
      "|            zombies|  2|    cat|   tv|\n",
      "|simpsons   cat lady|  2|   frog|table|\n",
      "|               null|  3|  eagle|glass|\n",
      "+-------------------+---+-------+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "columns = [\"words\", 1, \"animals\", 3]\n",
    "df.cols.select(columns).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select columns with a Regex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---+-----+\n",
      "|num|num 2|\n",
      "+---+-----+\n",
      "|  1|    1|\n",
      "|  2|    2|\n",
      "|  2|    3|\n",
      "|  3|    4|\n",
      "+---+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.select(\"n.*\", regex = True).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select all the columns of type string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+-------------------+-------------+------+-----+-------+\n",
      "|num 2|              words|  two strings|filter|thing|animals|\n",
      "+-----+-------------------+-------------+------+-----+-------+\n",
      "|    1|  I like     fish  |      cat-car|     a|housé|    dog|\n",
      "|    2|            zombies|       dog-tv|     b|   tv|    cat|\n",
      "|    3|simpsons   cat lady|eagle-tv-plus|     1|table|   frog|\n",
      "|    4|               null|      lion-pc|     c|glass|  eagle|\n",
      "+-----+-------------------+-------------+------+-----+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.select(\"*\", data_type = \"str\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename a columun"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|              words|number|animals|thing|  two strings|filter|num 2|        col_array|  col_int|\n",
      "+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|  I like     fish  |     1|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|            zombies|     2|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "|simpsons   cat lady|     2|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|               null|     3|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.rename('num','number').show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename multiple columns and uppercase all the columns¶"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|              WORDS|NUM|ANIMALS|THING|  TWO STRINGS|FILTER|NUM 2|        COL_ARRAY|  COL_INT|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|               null|  3|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.rename([('num','number'),(\"animals\",\"gods\")], str.upper).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert to uppercase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|              WORDS|NUM|ANIMALS|THING|  TWO STRINGS|FILTER|NUM 2|        COL_ARRAY|  COL_INT|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|               null|  3|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.rename(str.upper).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sort columns in alphabetical order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n",
      "|animals|        col_array|  col_int|filter|num|num 2|thing|  two strings|              words|\n",
      "+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n",
      "|    dog|    [baby, sorry]|[1, 2, 3]|     a|  1|    1|housé|      cat-car|  I like     fish  |\n",
      "|    cat|[baby 1, sorry 1]|   [3, 4]|     b|  2|    2|   tv|       dog-tv|            zombies|\n",
      "|   frog|[baby 2, sorry 2]|[5, 6, 7]|     1|  2|    3|table|eagle-tv-plus|simpsons   cat lady|\n",
      "|  eagle|[baby 3, sorry 3]|   [7, 8]|     c|  3|    4|glass|      lion-pc|               null|\n",
      "+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.sort(order = \"asc\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Drop multiple columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|animals|thing|  two strings|filter|num 2|        col_array|  col_int|\n",
      "+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.cols.drop([\"num\",\"words\"]).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sort by row values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|              words|num|animals|thing|  two strings|filter|num 2|        col_array|  col_int|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|               null|  3|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.rows.sort(\"animals\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select by row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n",
      "|              words|num|animals|thing|two strings|filter|num 2|    col_array|  col_int|\n",
      "+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n",
      "|  I like     fish  |  1|    dog|housé|    cat-car|     a|    1|[baby, sorry]|[1, 2, 3]|\n",
      "+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.rows.select(df[\"num\"]==1).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chaining\n",
    "\n",
    ".cols y .rows attributes are used to organize and encapsulate optimus functionality apart from Apache Spark Dataframe API.\n",
    "\n",
    "At the same time it can be helpfull when you look at the code because every line is self explained.\n",
    "\n",
    "The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Orginal DF\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|              words|num|animals|thing|  two strings|filter|num 2|        col_array|  col_int|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|    [baby, sorry]|[1, 2, 3]|\n",
      "|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|[baby 1, sorry 1]|   [3, 4]|\n",
      "|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|[baby 2, sorry 2]|[5, 6, 7]|\n",
      "|               null|  3|  eagle|glass|      lion-pc|     c|    4|[baby 3, sorry 3]|   [7, 8]|\n",
      "+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n",
      "\n",
      "New DF\n",
      "+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n",
      "|two strings|thing|num 2|new_col_2|new_col_1|filter|  col_int|        col_array|animals|\n",
      "+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n",
      "|    cat-car|housé|    1|spongebob|        1|     a|[1, 2, 3]|    [baby, sorry]|    dog|\n",
      "|     dog-tv|   tv|    2|spongebob|        1|     b|   [3, 4]|[baby 1, sorry 1]|    cat|\n",
      "|    lion-pc|glass|    4|spongebob|        1|     c|   [7, 8]|[baby 3, sorry 3]|  eagle|\n",
      "+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"Orginal DF\")\n",
    "df.show()\n",
    "\n",
    "print(\"New DF\")\n",
    "df\\\n",
    "    .cols.rename([('num','number')])\\\n",
    "    .cols.drop([\"number\",\"words\"])\\\n",
    "    .withColumn(\"new_col_2\", lit(\"spongebob\"))\\\n",
    "    .cols.append(\"new_col_1\", 1)\\\n",
    "    .cols.sort(order= \"desc\")\\\n",
    "    .rows.drop(df[\"num 2\"] == 3)\\\n",
    "    .show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}