Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Creating PySpark DataFrame from CSV in AWS S3 in EMR
# Example uses GDELT dataset found here: https://aws.amazon.com/public-datasets/gdelt/
# Column headers found here: http://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt
# Load RDD
lines = sc.textFile("s3://gdelt-open-data/events/2016*") # Loads 73,385,698 records from 2016
# Split lines into columns; change split() argument depending on deliminiter e.g. '\t'
parts = lines.map(lambda l: l.split('\t'))
# Convert RDD into DataFrame
from urllib import urlopen
html = urlopen("http://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt").read().rstrip()
columns = html.split('\t')
df = spark.createDataFrame(parts, columns)
# Loads RDD
lines = sc.textFile("s3://jakechenaws/tutorials/sample_data/iris/iris.csv")
# Split lines into columns; change split() argument depending on deliminiter e.g. '\t'
parts = lines.map(lambda l: l.split(','))
# Convert RDD into DataFrame
df = spark.createDataFrame(parts, ['sepal_length','sepal_width','petal_length','petal_width','class'])
@RohitJain88

This comment has been minimized.

Copy link

RohitJain88 commented Aug 17, 2019

Is this spark application running on local or in EMR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.