tusharvikky/pyspark-setup.ipynb

## pyspark-setup.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "PySpark Setup",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/tusharvikky/dd1c889c90f05bf28a99306b917dde7c/pyspark-setup.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sq8U3BtmhtRx",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "# **Running Pyspark in Colab**\n",
        "\n",
        "To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.4.7 with hadoop 2.7, Java 8 and Find spark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. \n",
        "Follow the steps to install the dependencies:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lh5NCoc8fsSO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
        "!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz\n",
        "!tar xf spark-2.4.7-bin-hadoop2.7.tgz\n",
        "!pip install -q findspark"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ILheUROOhprv",
        "colab_type": "text"
      },
      "source": [
        "Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "v1b8k_OVf2QF",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import os\n",
        "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
        "os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.7-bin-hadoop2.7\""
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KwrqMk3HiMiE",
        "colab_type": "text"
      },
      "source": [
        "Run a local spark session to test your installation:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9_Uz1NL4gHFx",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import findspark\n",
        "findspark.init()\n",
        "from pyspark.sql import SparkSession\n",
        "spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JEb4HTRwiaJx",
        "colab_type": "text"
      },
      "source": [
        "Congrats! Your Colab is ready to run Pyspark.\n"
      ]
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "PySpark Setup",
	"provenance": [],
	"collapsed_sections": [],
	"toc_visible": true,
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/tusharvikky/dd1c889c90f05bf28a99306b917dde7c/pyspark-setup.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "sq8U3BtmhtRx",
	"colab_type": "text"
	},
	"source": [
	"\n",
	"# Running Pyspark in Colab\n",
	"\n",
	"To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.4.7 with hadoop 2.7, Java 8 and Find spark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. \n",
	"Follow the steps to install the dependencies:"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "lh5NCoc8fsSO",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
	"!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz\n",
	"!tar xf spark-2.4.7-bin-hadoop2.7.tgz\n",
	"!pip install -q findspark"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "ILheUROOhprv",
	"colab_type": "text"
	},
	"source": [
	"Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "v1b8k_OVf2QF",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import os\n",
	"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
	"os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.7-bin-hadoop2.7\""
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "KwrqMk3HiMiE",
	"colab_type": "text"
	},
	"source": [
	"Run a local spark session to test your installation:"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "9_Uz1NL4gHFx",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import findspark\n",
	"findspark.init()\n",
	"from pyspark.sql import SparkSession\n",
	"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
	],
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "JEb4HTRwiaJx",
	"colab_type": "text"
	},
	"source": [
	"Congrats! Your Colab is ready to run Pyspark.\n"
	]
	}
	]
	}