Skip to content

Instantly share code, notes, and snippets.

@johnb30
Created September 27, 2013 15:05
Show Gist options
  • Save johnb30/6730026 to your computer and use it in GitHub Desktop.
Save johnb30/6730026 to your computer and use it in GitHub Desktop.
Introduction to GDELT for the Hacking GDELT event
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Subsetting GDELT\n",
"\n",
"The first approach to subsetting the GDELT data involves iterating over a list of files, opening each file, and iterating over each line in the file. This solves the problem of ingesting all 60GB+ of the GDELT data at a time. Once the subset is obtained, further analysis can be performed as usual while holding the smaller dataset in memory.\n",
"\n",
"###The Script\n",
"\n",
"The general approach for the rest of this tutorial is to present a chunk of the style of script that I use and then walk through the chunk line by line. As a caveat, it is possible to make this basic script more complex using command-line arguments and providing more arguments to the `process_gdelt` function presented below. This is not a particularly hard step to make, however, so I'll leave this as an exercise for those who are interested. I'm more than willing to help if you become stuck, though. Finally, there are some slightly more advanced techniques presented in order to process the files in parallel, but these can be ignored if you so desire. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import datetime\n",
"import glob\n",
"import pp\n",
"\n",
"def process_gdelt(in_file, action_country):\n",
" \"\"\"\"\n",
" Function to subset a file of the GDELT data based on a the ActionGeo_CountryCode variable.\n",
" \n",
" Parameters\n",
" ----------\n",
" \n",
" in_file: String.\n",
" Filepath for a given file containing some set of the GDELT data.\n",
" \n",
" action_country: String.\n",
" ISO-Alpha 2 country code of interest for action geolocation.\n",
" \n",
" Returns\n",
" -------\n",
" \n",
" data_out: String.\n",
" GDELT subset generated by the function. Columns are separated by tabs,\n",
" with lines terminated by '\\n' characters.\n",
" \"\"\"\n",
" data = open(in_file, 'r')\n",
" output = list()\n",
" for line in data:\n",
" line = line.replace('\\n', '')\n",
" split = line.split('\\t')\n",
" if len(split) < 58:\n",
" split.append('')\n",
" actiongeo_countrycode = split[51]\n",
" if actiongeo_countrycode == action_country:\n",
" output.append('\\t'.join(split))\n",
" data_out = '\\n'.join(output)\n",
" return data_out"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###`process_gdelt`\n",
"\n",
"The `process_gdelt` function assumes that you are passing it a filepath to a specific set of the GDELT data. It's possible to rework this so that you are iterating over a list of filepaths within the function, but, as will be seen below, processing one file at a time within the function allows for parallel processing.\n",
"\n",
"The first step in the `process_gdelt` function is to open the file containing the GDELT data. This creates a file object, which is very lightweight in terms of memory. The function then uses the `for line in data` convention to lazily iterate over the lines in the file object. Each line is cleaned (removing excess `\\n` characters) and split on the tabs. Since the data contains a different number of columns depending on the year, a simple check of the length of the file is performed (`if len(split) < 58`) and an empty field appended if necessary to ensure consistency. Then the field of interest is defined, in this case column 51: `ActionGeo_CountryCode`. The function then checks to see if this column matches the desired value and if so, joins the fields using tabs and appends the line to the temporary holding list. Finally, this holding list is converted to a string, with each line separated by a `\\n` character, and returned."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"if __name__ == '__main__':\n",
" print 'Running...'\n",
" gdelt_paths = glob.glob('*.CSV') + glob.glob('*.csv')\n",
"\n",
" servers = ()\n",
" job_server = pp.Server(ppservers=servers)\n",
" print 'Submitting jobs to parallel server...'\n",
" time1 = datetime.datetime.now()\n",
" print 'Time: {:02.0f}:{:02.0f}.{:02.0f}'.format(time1.hour,\n",
" time1.minute,\n",
" time1.second)\n",
" jobs = [job_server.submit(process_gdelt, (path, 'PK',), (), ()) for path in\n",
" gdelt_subset_paths]\n",
"\n",
" results = list()\n",
" for job in jobs:\n",
" results.append(job())\n",
"\n",
" final_data = '\\n'.join(results)\n",
" print 'Writing data...'\n",
" time2 = datetime.datetime.now()\n",
" print 'Time: {:02.0f}:{:02.0f}.{:02.0f}'.format(time2.hour,\n",
" time2.minute,\n",
" time2.second)\n",
" with open('./results/gdelt_pakistan.csv', 'w') as f:\n",
" f.write(final_data)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Running the script\n",
"\n",
"With the primary function defined, it is possible to move to running the script. The first step is to obtain a list of all the filepaths to files that contain the GDELT data. To do so, `glob` is used to pull in any files that end in `.CSV` or `.csv`. It is important to note that this assumes that the directory the GDELT data is stored in is clean and does not contain any other CSV files. The next step (`ppservers = (); job_server = pp.Server(ppservers=ppservers)`) creates the servers for the parallel processing. In this case, the empty tuple indicates that a server should be created for each available processor core. The next substantive line (`jobs = [job_server.submit(process_gdelt, (path, 'PK',), (), ()) for path in gdelt_subset_paths]`) schedules the filepaths for processing on the separate CPU cores. This line reads simply \"submit a job to the `job_server` to process each path in the list of filepaths using the `process_gdelt` function with arguments of the filepath and the country code 'PK'.\" The empty tuples indicate that no additional libraries are needed, and that the `process_gdelt` function does not rely on any other functions. \n",
"\n",
"Once the jobs are submitted, it is necessary to iterate over the `jobs` list and call each job. This is important to note since the job objects are lazily evaluated. In other words, they do not perform their work until told to do so. Thus, the code\n",
"\n",
"```\n",
"results = list()\n",
"for job in jobs:\n",
" results.append(job())\n",
"```\n",
"\n",
"iterates over each job object, causes the object to run, and appends the output to the `results` list. This gives a list of lists, with the inner lists containing the tab and `\\n` separated strings. The `results` list is then joined using `\\n` characters (`'\\n'.join(results)`) and written to a file.\n",
"\n",
"*Non-parallel*\n",
"\n",
"In case you don't want to process the files in parallel, or don't wish to install the `parallel python` library, the files can be processed in serial using the following code."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"if __name__ == '__main__':\n",
" print 'Running...'\n",
" gdelt_paths = glob.glob('*.CSV') + glob.glob('*.csv')\n",
"\n",
" time1 = datetime.datetime.now()\n",
" print 'Time: {:02.0f}:{:02.0f}.{:02.0f}'.format(time1.hour,\n",
" time1.minute,\n",
" time1.second)\n",
" results = list()\n",
" for path in gdelt_paths:\n",
" print 'Processing {}'.format(path)\n",
" result = process_gdelt(path, 'PK')\n",
" results.append(result)\n",
"\n",
" final_data = '\\n'.join(results)\n",
" print 'Writing data...'\n",
" time2 = datetime.datetime.now()\n",
" print 'Time: {:02.0f}:{:02.0f}.{:02.0f}'.format(time2.hour,\n",
" time2.minute,\n",
" time2.second)\n",
" with open('./results/gdelt_pakistan.csv', 'w') as f:\n",
" f.write(final_data)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Further Analysis\n",
"\n",
"Once you have obtained a subset of the data, you can use whatever tools fit naturally into your workflow. For my work, I make use of `pandas`, `pandasql`, and `statsmodels`/`scikit-learn` to perform my analyses. To give a brief rundown of the `pandas` and `pandasql`, I'll read in the subset created above and perform some basic exploratory analysis on the data. It is possible to modify the above script to return a list of lists, which `pandas` can turn into a `DataFrame` object directly, but for now I will make use of the CSV file generated earlier."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"from pandasql import sqldf\n",
"\n",
"names = ['GLOBALEVENTID', 'SQLDATE', 'MonthYear', 'Year', 'FractionDate',\n",
"'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode',\n",
"'Actor1EthnicCode', 'Actor1Religion1Code', 'Actor1Religion2Code', 'Actor1Type1Code',\n",
"'Actor1Type2Code','Actor1Type3Code','Actor2Code','Actor2Name','Actor2CountryCode',\n",
"'Actor2KnownGroupCode','Actor2EthnicCode','Actor2Religion1Code','Actor2Religion2Code',\n",
"'Actor2Type1Code','Actor2Type2Code','Actor2Type3Code','IsRootEvent','EventCode',\n",
"'EventBaseCode','EventRootCode','QuadClass','GoldsteinScale','NumMentions',\n",
"'NumSources','NumArticles','AvgTone','Actor1Geo_Type','Actor1Geo_FullName',\n",
"'Actor1Geo_CountryCode','Actor1Geo_ADM1Code','Actor1Geo_Lat','Actor1Geo_Long',\n",
"'Actor1Geo_FeatureID','Actor2Geo_Type','Actor2Geo_FullName','Actor2Geo_CountryCode',\n",
"'Actor2Geo_ADM1Code','Actor2Geo_Lat','Actor2Geo_Long','Actor2Geo_FeatureID',\n",
"'ActionGeo_Type','ActionGeo_FullName','ActionGeo_CountryCode','ActionGeo_ADM1Code',\n",
"'ActionGeo_Lat','ActionGeo_Long','ActionGeo_FeatureID','DATEADDED','SOURCEURL']\n",
"\n",
"gdelt = pd.read_csv('./results/gdelt_pakistan.csv', sep='\\t', header=False, names=names, \n",
" dtype={'EventCode': object, 'EventBaseCode': object, 'EventRootCode': object},\n",
" encoding='utf-8')\n",
"\n",
"query = \"\"\"\n",
"SELECT\n",
"SQLDATE, ActionGeo_FullName, count(*) as event_count\n",
"FROM\n",
"gdelt\n",
"WHERE\n",
"EventRootCode == '19'\n",
"AND\n",
"MonthYear >= 201308\n",
"GROUP BY\n",
"SQLDATE, ActionGeo_FullName;\n",
"\"\"\"\n",
"\n",
"subset = sqldf(query, globals())\n",
"\n",
"print subset.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" SQLDATE ActionGeo_FullName event_count\n",
"0 20130801 Abbasi Shaheed Hospital, Sindh, Pakistan 1\n",
"1 20130801 Abbottabad, North-West Frontier, Pakistan 3\n",
"2 20130801 Abdul Wahid, Balochistan, Pakistan 1\n",
"3 20130801 Adiala, Punjab, Pakistan 1\n",
"4 20130801 Ahsanabad, Punjab, Pakistan 1\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###`pandas` and `pandasql`\n",
"\n",
"Using the `pandas` and `pandasql` libraries makes this type of exploratory data analysis extremely easy. The data is read in using the `read_csv` function from `pandas`. It is important to define the data type for some of the columns, since `pandas` will strip the leading zeros from the CAMEO codes. This can create difficulties since those zeros are rather important.\n",
"\n",
"With the data read in, it is simply a matter of defining a SQL query using standard syntax and then calling the `sqldf` function with the query as an argument along with `globals()`, which tells `sqldf` to look in the global namespace for the appropriate table (or `DataFrame` as is the case here). This function returns a `DataFrame`, which looks about as one would expect."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.core.display import HTML\n",
"#Making the CSS styling nice for the IPython notebook\n",
"\n",
"def css_styling():\n",
" styles = open(\"./styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<style>\n",
" @font-face {\n",
" font-family: \"Computer Modern\";\n",
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
" }\n",
" div.cell{\n",
" width:800px;\n",
" margin-left:16% !important;\n",
" margin-right:auto;\n",
" }\n",
" h1 {\n",
" font-family: Helvetica, serif;\n",
" }\n",
" h4{\n",
" margin-top:12px;\n",
" margin-bottom: 3px;\n",
" }\n",
" div.text_cell_render{\n",
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
" line-height: 145%;\n",
" font-size: 130%;\n",
" width:800px;\n",
" margin-left:auto;\n",
" margin-right:auto;\n",
" }\n",
" .CodeMirror{\n",
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
" }\n",
" .prompt{\n",
" display: None;\n",
" }\n",
" .text_cell_render h5 {\n",
" font-weight: 300;\n",
" font-size: 22pt;\n",
" color: #4057A1;\n",
" font-style: italic;\n",
" margin-bottom: .5em;\n",
" margin-top: 0.5em;\n",
" display: block;\n",
" }\n",
" \n",
" .warning{\n",
" color: rgb( 240, 20, 20 )\n",
" } \n",
"</style>\n",
"<script>\n",
" MathJax.Hub.Config({\n",
" TeX: {\n",
" extensions: [\"AMSmath.js\"]\n",
" },\n",
" tex2jax: {\n",
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
" },\n",
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
" \"HTML-CSS\": {\n",
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
" }\n",
" });\n",
"</script>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 1,
"text": [
"<IPython.core.display.HTML at 0x10afb4610>"
]
}
],
"prompt_number": 1
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment