curl -v -X POST -u <USER>@idibon.com "https://historical.gnip.com/accounts/idibon/jobs.json" -d '{"publisher":"twitter","streamType":"track","dataFormat":"activity-streams","fromDate":"201501010000","toDate":"201506010001","title":"arabizi_testing","rules":[{"value":"3omry","tag":"3omry"},{"value":"3ala","tag":"3ala"},{"value":"enta","tag":"enta"},{"value":"ma3ak","tag":"ma3ak"}]}'
The JSON expanded is below. For some shoddy documentation about this process and a general overview of the process, check out GNIP's API Documentation. But this gist will be far more useful.
curl -v -X POST -u @idibon.com "https://historical.gnip.com/accounts/idibon/jobs.json" -d '{ "publisher": "twitter", "streamType": "track", "dataFormat": "activity-streams", "fromDate": "201501010000", "toDate": "201506010001", "title": "arabizi_testing", "rules": [ { "value": "3omry", "tag": "3omry" }, { "value": "3ala", "tag": "3ala" }, { "value": "enta", "tag": "enta" }, { "value": "ma3ak", "tag": "ma3ak" } ] }'
This returns a JSON object. **You must grab the UUID from this response**. There are other ways to do it (getting a full joblist), but this is by far the easiest.
>```
{
"title": "arabizi_testing",
"account": "idibon",
"publisher": "twitter",
"streamType": "track",
"format": "activity_streams",
"fromDate": "201501010000",
"toDate": "201506010001",
"requestedBy": "nick@idibon.com",
"requestedAt": "2015-06-15T22:17:20Z",
"status": "opened",
"statusMessage": "Waiting on quote from Gnip.",
"jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json"
}
"jobURL":"https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/
f3ehpx383h.json"
curl -sS GET -u <USER>@idibon.com https://historical.gnip.com/accounts/idibon/jobs/<JOB_UUID>.json | python -m json.tool
{ "title": "arabizi_testing", "account": "idibon", "publisher": "twitter", "streamType": "track", "format": "activity_streams", "fromDate": "201501010000", "toDate": "201506010001", "requestedBy": "nick@idibon.com", "requestedAt": "2015-06-15T22:17:20Z", "status": "quoted", "statusMessage": "Job quoted and awaiting customer acceptance.", "jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json", "quote": { "estimatedActivityCount": 787000, "estimatedDurationHours": "13.0", "estimatedFileSizeMb": "459.11", "expiresAt": "2015-06-22T22:21:36Z" }, "percentComplete": 0 }
## Accept or reject the quote given
>`curl -v -X PUT -u <USER>@idibon.com "https://historical.gnip.com/accounts/idibon/publishers/twitter/historical/track/jobs/<JOB_UUID>.json" -d '{"status":"accept"}'`
Then you'll receive this
>```
{
"title": "arabizi_one_week",
"account": "idibon",
"publisher": "twitter",
"streamType": "track",
"format": "activity_streams",
"fromDate": "201501010000",
"toDate": "201501080001",
"requestedBy": "razi@idibon.com",
"requestedAt": "2015-06-15T23:17:10Z",
"status": "accepted",
"statusMessage": "Job accepted and ready to be queued.",
"jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/4jpv1n222f.json",
"quote": {
"estimatedActivityCount": 38000,
"estimatedDurationHours": "1.0",
"estimatedFileSizeMb": "21.91",
"expiresAt": "2015-06-22T23:18:19Z"
},
"acceptedBy": "razi@idibon.com",
"acceptedAt": "2015-06-15T23:23:34Z"
}
curl -sS GET -u <USER>@idibon.com https://historical.gnip.com/accounts/idibon/jobs
{ "jobs": [ { "uuid": "hj0z9e606a", "title": "arabizi_test", "jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/hj0z9e606a.json", "status": "rejected", "publisher": "twitter", "streamType": "track", "fromDate": "201501010000", "toDate": "201505010001", "percentComplete": 0, "expiresAt": "2015-06-22T21:12:31Z" }, { "uuid": "g4ftcv73kk", "title": "arabizi_testing", "jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/g4ftcv73kk.json", "status": "rejected", "publisher": "twitter", "streamType": "track", "fromDate": "201501010000", "toDate": "201506010001", "percentComplete": 0, "expiresAt": "2015-06-22T21:16:24Z" }, { "uuid": "f3ehpx383h", "title": "arabizi_testing", "jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json", "status": "quoted", "publisher": "twitter", "streamType": "track", "fromDate": "201501010000", "toDate": "201506010001", "percentComplete": 0, "expiresAt": "2015-06-22T22:21:36Z" } ], "delivered": { "jobCount": 0, "jobDaysRun": 0, "activityCount": 0, "period": "trial", "since": "2015-06-10T17:35:18Z" } }
## Download files once they're ready
>```
curl -sS -u <USER>:<PASSWORD> https://historical.gnip.com/accounts/idibon/publishers/twitter/historical/track/jobs/<JOB_ID>/results.csv | xargs -P 8 -t -n2 curl -o
Alternatively, you can use a Bash script GNIP provides that GNIP says can handle interrupted downloads. I haven't tried it yet.
Currently, I can download with Nick's account (admin) and not my own (user), even though I made the job myself. We're talking to GNIP about that.
If you choose to take results.csv
manually, you'll get something that looks like this:
'20150101-20150108_4jpv1n222f_2015_01_01_00_00_activities.json.gz' 'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/00_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=csk8qfwf%2BdTAGF9q9XPAhVBnvSk%3D' '20150101-20150108_4jpv1n222f_2015_01_01_00_10_activities.json.gz' 'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/10_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=xN0GDoNp50zjHj3d%2F%2Fq7%2B2UfRJs%3D' '20150101-20150108_4jpv1n222f_2015_01_01_00_20_activities.json.gz' 'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/20_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=t3iLPdxGBa2%2BuELa7Je3hRwtMUk%3D' '20150101-20150108_4jpv1n222f_2015_01_01_00_30_activities.json.gz' 'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/30_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=qEeHDs7cHPTtU%2FcbIUjNALqCxhY%3D'
Huh, weird, that's tab-separated instead of comma-separated. Thanks GNIP! It's the filenames followed by their S3 locations, so now you can download those files. Alternatively, if you used the full code above, you'll now have a whole bunch of .json.gz files (for every ten minutes of data) in your local directory.
## Unzipping and playing with the data
Run `find . -name '*.json.gz' -exec gunzip {} +` to unzip all of the files in the local directory. Note that this will not leave the original .gz files -- they will be replaced by their unzipped versions. Trying other approaches might fail; because it's split up in ten-minute splices, there will probably be a huge number of files for long queries.
Your JSON files now contain a series of JSON objects separated (sometimes) by newlines, and with a trailing JSON info object.
>```
{
"id": "tag:search.twitter.com,2005:550494145895096320",
"objectType": "activity",
"actor": {
"objectType": "person",
"id": "id:twitter.com:465687760",
"link": "http://www.twitter.com/NouranFaisal",
"displayName": "Nouran Faisal",
"postedTime": "2012-01-16T16:44:49.000Z",
"image": "https://pbs.twimg.com/profile_images/464141975080689666/RkCDUpQN_normal.jpeg",
"summary": null,
"links": [
{
"href": null,
"rel": "me"
}
],
"friendsCount": 99,
"followersCount": 259,
"listedCount": 1,
"statusesCount": 7998,
"twitterTimeZone": "Athens",
"verified": false,
"utcOffset": "7200",
"preferredUsername": "NouranFaisal",
"languages": [
"en"
],
"favoritesCount": 136
},
"verb": "share",
"postedTime": "2015-01-01T03:30:06.000Z",
"generator": {
"displayName": "Twitter for iPhone",
"link": "http://twitter.com/download/iphone"
},
"provider": {
"objectType": "service",
"displayName": "Twitter",
"link": "http://www.twitter.com"
},
"link": "http://twitter.com/NouranFaisal/statuses/550494145895096320",
"body": "RT @donnamaherr: Howa 3amtan akeed inshallah hatkoon a happy new year tool ma enta msh feeha.",
"object": {
"id": "tag:search.twitter.com,2005:550077998737948672",
"objectType": "activity",
"actor": {
"objectType": "person",
"id": "id:twitter.com:904035835",
"link": "http://www.twitter.com/donnamaherr",
"displayName": "Donna.",
"postedTime": "2012-10-25T14:38:12.000Z",
"image": "https://pbs.twimg.com/profile_images/540882702363525120/DCNfXLEH_normal.jpeg",
"summary": "Instagram: donnamaherr",
"links": [
{
"href": null,
"rel": "me"
}
],
"friendsCount": 163,
"followersCount": 12772,
"listedCount": 36,
"statusesCount": 12081,
"twitterTimeZone": null,
"verified": false,
"utcOffset": null,
"preferredUsername": "donnamaherr",
"languages": [
"en"
],
"location": {
"objectType": "place",
"displayName": "Maadi"
},
"favoritesCount": 3367
},
"verb": "post",
"postedTime": "2014-12-30T23:56:29.000Z",
"generator": {
"displayName": "Twitter for iPhone",
"link": "http://twitter.com/download/iphone"
},
"provider": {
"objectType": "service",
"displayName": "Twitter",
"link": "http://www.twitter.com"
},
"link": "http://twitter.com/donnamaherr/statuses/550077998737948672",
"body": "Howa 3amtan akeed inshallah hatkoon a happy new year tool ma enta msh feeha.",
"object": {
"objectType": "note",
"id": "object:search.twitter.com,2005:550077998737948672",
"summary": "Howa 3amtan akeed inshallah hatkoon a happy new year tool ma enta msh feeha.",
"link": "http://twitter.com/donnamaherr/statuses/550077998737948672",
"postedTime": "2014-12-30T23:56:29.000Z"
},
"favoritesCount": 23,
"twitter_entities": {
"hashtags": [],
"trends": [],
"urls": [],
"user_mentions": [],
"symbols": []
},
"twitter_filter_level": "low",
"twitter_lang": "en"
},
"favoritesCount": 0,
"twitter_entities": {
"hashtags": [],
"trends": [],
"urls": [],
"user_mentions": [
{
"screen_name": "donnamaherr",
"name": "Donna.",
"id": 904035835,
"id_str": "904035835",
"indices": [
3,
15
]
}
],
"symbols": []
},
"twitter_filter_level": "medium",
"twitter_lang": "en",
"retweetCount": 52,
"gnip": {
"matching_rules": [
{
"value": "enta",
"tag": "enta"
}
],
"klout_score": 17,
"language": {
"value": "en"
}
}
}
And the final line:
{"info":{"message":"Replay Request Completed","sent":"2015-06-15T23:27:55+00:00","activity_count":6}}