Skip to content

Instantly share code, notes, and snippets.

@SuvroBaner
Last active July 10, 2018 04:07
Show Gist options
  • Save SuvroBaner/8db0be1d85d8607974e8bbfe813cabf2 to your computer and use it in GitHub Desktop.
Save SuvroBaner/8db0be1d85d8607974e8bbfe813cabf2 to your computer and use it in GitHub Desktop.
# Creating Spark Configuration and Spark Context-
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("My Dataframe")
sc = SparkContext(conf = conf)
from pyspark.sql import SparkSession # To work with dataframe we need pyspark.sql
spark = SparkSession(sc) # passing Spark Context to SQL module
myRange = spark.range(1000).toDF("number")
# myRange is a Spark DataFrame with one column containing 1,000 rows with values from 0 to 999.
# When run on a cluster, each part of this range of numbers exists on a different executor.
# Let's perform a transformation-
divisBy2 = myRange.where("number % 2 = 0") # `where` is an alias for :func:`filter`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment