Skip to content

Instantly share code, notes, and snippets.

@tusharvikky
Last active January 25, 2021 17:40
Show Gist options
  • Save tusharvikky/dd1c889c90f05bf28a99306b917dde7c to your computer and use it in GitHub Desktop.
Save tusharvikky/dd1c889c90f05bf28a99306b917dde7c to your computer and use it in GitHub Desktop.
PySpark Setup
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "PySpark Setup",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/tusharvikky/dd1c889c90f05bf28a99306b917dde7c/pyspark-setup.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sq8U3BtmhtRx",
"colab_type": "text"
},
"source": [
"\n",
"# **Running Pyspark in Colab**\n",
"\n",
"To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.4.7 with hadoop 2.7, Java 8 and Find spark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. \n",
"Follow the steps to install the dependencies:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "lh5NCoc8fsSO",
"colab_type": "code",
"colab": {}
},
"source": [
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz\n",
"!tar xf spark-2.4.7-bin-hadoop2.7.tgz\n",
"!pip install -q findspark"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "ILheUROOhprv",
"colab_type": "text"
},
"source": [
"Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "v1b8k_OVf2QF",
"colab_type": "code",
"colab": {}
},
"source": [
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.7-bin-hadoop2.7\""
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KwrqMk3HiMiE",
"colab_type": "text"
},
"source": [
"Run a local spark session to test your installation:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "9_Uz1NL4gHFx",
"colab_type": "code",
"colab": {}
},
"source": [
"import findspark\n",
"findspark.init()\n",
"from pyspark.sql import SparkSession\n",
"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "JEb4HTRwiaJx",
"colab_type": "text"
},
"source": [
"Congrats! Your Colab is ready to run Pyspark.\n"
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment