Skip to content

Instantly share code, notes, and snippets.



Last active Jun 29, 2018
What would you like to do?
Converts the GDELT Dataset in S3 to Parquet.
# Get the column names
from urllib import urlopen
html = urlopen("").read().rstrip()
columns = html.split('\t')
# Load 73,385,698 records from 2016
df1 ="delimiter", "\t").csv("s3://gdelt-open-data/events/2016*")
# Apply the schema
# Split SQLDATE to Year, Month and Day
from pyspark.sql.functions import expr
df3 = df2.withColumn("Month", expr("substring(SQLDATE, 5, 2)")).withColumn("Day", expr("substring(SQLDATE, 7, 2)"))
# Write to parquet in S3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.