Skip to content

Instantly share code, notes, and snippets.

@nmukerje
Last active March 1, 2021 04:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save nmukerje/38183d0622ac2195552fd834b53fd7ee to your computer and use it in GitHub Desktop.
Save nmukerje/38183d0622ac2195552fd834b53fd7ee to your computer and use it in GitHub Desktop.
Converts the GDELT Dataset in S3 to Parquet.
# Get the column names
from urllib import urlopen
html = urlopen("http://gdeltproject.org/data/lookups/CSV.header.dailyupdates.txt").read().rstrip()
columns = html.split('\t')
# Load 73,385,698 records from 2016
df1 = spark.read.option("delimiter", "\t").csv("s3://gdelt-open-data/events/2016*")
# Apply the schema
df2=df1.toDF(*columns)
# Split SQLDATE to Year, Month and Day
from pyspark.sql.functions import expr
df3 = df2.withColumn("Month", expr("substring(SQLDATE, 5, 2)")).withColumn("Day", expr("substring(SQLDATE, 7, 2)"))
# Write to parquet in S3
cols=["Year","Month","Day"]
df3.repartition(*cols).write.mode("append").partitionBy(cols).parquet("s3://<bucket>/gdelt/")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment