Skip to content

Instantly share code, notes, and snippets.

Last active March 1, 2021 04:07
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save nmukerje/38183d0622ac2195552fd834b53fd7ee to your computer and use it in GitHub Desktop.
Converts the GDELT Dataset in S3 to Parquet.
# Get the column names
from urllib import urlopen
html = urlopen("").read().rstrip()
columns = html.split('\t')
# Load 73,385,698 records from 2016
df1 ="delimiter", "\t").csv("s3://gdelt-open-data/events/2016*")
# Apply the schema
# Split SQLDATE to Year, Month and Day
from pyspark.sql.functions import expr
df3 = df2.withColumn("Month", expr("substring(SQLDATE, 5, 2)")).withColumn("Day", expr("substring(SQLDATE, 7, 2)"))
# Write to parquet in S3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment