rrshaban/gnip_curl_tutorial.md

## gnip_curl_tutorial.md

      
    Raw
  

              gnip_curl_tutorial.md
            
          
    Create a GNIP job


curl -v -X POST -u <USER>@idibon.com "https://historical.gnip.com/accounts/idibon/jobs.json" -d '{"publisher":"twitter","streamType":"track","dataFormat":"activity-streams","fromDate":"201501010000","toDate":"201506010001","title":"arabizi_testing","rules":[{"value":"3omry","tag":"3omry"},{"value":"3ala","tag":"3ala"},{"value":"enta","tag":"enta"},{"value":"ma3ak","tag":"ma3ak"}]}'

The JSON expanded is below. For some shoddy documentation about this process and a general overview of the process, check out GNIP's API Documentation. But this gist will be far more useful.


curl -v -X POST -u @idibon.com "https://historical.gnip.com/accounts/idibon/jobs.json" -d '{
"publisher": "twitter",
"streamType": "track",
"dataFormat": "activity-streams",
"fromDate": "201501010000",
"toDate": "201506010001",
"title": "arabizi_testing",
"rules": [
{
"value": "3omry",
"tag": "3omry"
},
{
"value": "3ala",
"tag": "3ala"
},
{
"value": "enta",
"tag": "enta"
},
{
"value": "ma3ak",
"tag": "ma3ak"
}
]
}'

This returns a JSON object. **You must grab the UUID from this response**. There are other ways to do it (getting a full joblist), but this is by far the easiest.

>```
{
  "title": "arabizi_testing",
  "account": "idibon",
  "publisher": "twitter",
  "streamType": "track",
  "format": "activity_streams",
  "fromDate": "201501010000",
  "toDate": "201506010001",
  "requestedBy": "nick@idibon.com",
  "requestedAt": "2015-06-15T22:17:20Z",
  "status": "opened",
  "statusMessage": "Waiting on quote from Gnip.",
  "jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json"
}

"jobURL":"https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json"
Get the status of a GNIP job


curl -sS GET -u <USER>@idibon.com https://historical.gnip.com/accounts/idibon/jobs/<JOB_UUID>.json | python -m json.tool


{
"title": "arabizi_testing",
"account": "idibon",
"publisher": "twitter",
"streamType": "track",
"format": "activity_streams",
"fromDate": "201501010000",
"toDate": "201506010001",
"requestedBy": "nick@idibon.com",
"requestedAt": "2015-06-15T22:17:20Z",
"status": "quoted",
"statusMessage": "Job quoted and awaiting customer acceptance.",
"jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json",
"quote": {
"estimatedActivityCount": 787000,
"estimatedDurationHours": "13.0",
"estimatedFileSizeMb": "459.11",
"expiresAt": "2015-06-22T22:21:36Z"
},
"percentComplete": 0
}

## Accept or reject the quote given

>`curl -v -X PUT -u <USER>@idibon.com "https://historical.gnip.com/accounts/idibon/publishers/twitter/historical/track/jobs/<JOB_UUID>.json" -d '{"status":"accept"}'`

Then you'll receive this

>```
{
  "title": "arabizi_one_week",
  "account": "idibon",
  "publisher": "twitter",
  "streamType": "track",
  "format": "activity_streams",
  "fromDate": "201501010000",
  "toDate": "201501080001",
  "requestedBy": "razi@idibon.com",
  "requestedAt": "2015-06-15T23:17:10Z",
  "status": "accepted",
  "statusMessage": "Job accepted and ready to be queued.",
  "jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/4jpv1n222f.json",
  "quote": {
    "estimatedActivityCount": 38000,
    "estimatedDurationHours": "1.0",
    "estimatedFileSizeMb": "21.91",
    "expiresAt": "2015-06-22T23:18:19Z"
  },
  "acceptedBy": "razi@idibon.com",
  "acceptedAt": "2015-06-15T23:23:34Z"
}

Get list of all jobs


curl -sS GET -u <USER>@idibon.com https://historical.gnip.com/accounts/idibon/jobs


{
"jobs": [
{
"uuid": "hj0z9e606a",
"title": "arabizi_test",
"jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/hj0z9e606a.json",
"status": "rejected",
"publisher": "twitter",
"streamType": "track",
"fromDate": "201501010000",
"toDate": "201505010001",
"percentComplete": 0,
"expiresAt": "2015-06-22T21:12:31Z"
},
{
"uuid": "g4ftcv73kk",
"title": "arabizi_testing",
"jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/g4ftcv73kk.json",
"status": "rejected",
"publisher": "twitter",
"streamType": "track",
"fromDate": "201501010000",
"toDate": "201506010001",
"percentComplete": 0,
"expiresAt": "2015-06-22T21:16:24Z"
},
{
"uuid": "f3ehpx383h",
"title": "arabizi_testing",
"jobURL": "https://historical.gnip.com:443/accounts/idibon/publishers/twitter/historical/track/jobs/f3ehpx383h.json",
"status": "quoted",
"publisher": "twitter",
"streamType": "track",
"fromDate": "201501010000",
"toDate": "201506010001",
"percentComplete": 0,
"expiresAt": "2015-06-22T22:21:36Z"
}
],
"delivered": {
"jobCount": 0,
"jobDaysRun": 0,
"activityCount": 0,
"period": "trial",
"since": "2015-06-10T17:35:18Z"
}
}

## Download files once they're ready

>```
curl -sS -u <USER>:<PASSWORD> https://historical.gnip.com/accounts/idibon/publishers/twitter/historical/track/jobs/<JOB_ID>/results.csv | xargs -P 8 -t -n2 curl -o

Alternatively, you can use a Bash script GNIP provides that GNIP says can handle interrupted downloads. I haven't tried it yet.
Currently, I can download with Nick's account (admin) and not my own (user), even though I made the job myself. We're talking to GNIP about that.
If you choose to take results.csv manually, you'll get something that looks like this:


'20150101-20150108_4jpv1n222f_2015_01_01_00_00_activities.json.gz'      'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/00_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=csk8qfwf%2BdTAGF9q9XPAhVBnvSk%3D'
'20150101-20150108_4jpv1n222f_2015_01_01_00_10_activities.json.gz'      'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/10_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=xN0GDoNp50zjHj3d%2F%2Fq7%2B2UfRJs%3D'
'20150101-20150108_4jpv1n222f_2015_01_01_00_20_activities.json.gz'      'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/20_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=t3iLPdxGBa2%2BuELa7Je3hRwtMUk%3D'
'20150101-20150108_4jpv1n222f_2015_01_01_00_30_activities.json.gz'      'https://s3-us-west-1.amazonaws.com/archive.replay.snapshots/snapshots/twitter/track/activity_streams/idibon/2015/06/15/20150101-20150108_4jpv1n222f/2015/01/01/00/30_activities.json.gz?AWSAccessKeyId=AKIAIQSGGQQJWT3OTJSQ&Expires=1437003266&Signature=qEeHDs7cHPTtU%2FcbIUjNALqCxhY%3D'

Huh, weird, that's tab-separated instead of comma-separated. Thanks GNIP! It's the filenames followed by their S3 locations, so now you can download those files. Alternatively, if you used the full code above, you'll now have a whole bunch of .json.gz files (for every ten minutes of data) in your local directory. 

## Unzipping and playing with the data

Run `find . -name '*.json.gz' -exec gunzip {} +` to unzip all of the files in the local directory. Note that this will not leave the original .gz files -- they will be replaced by their unzipped versions. Trying other approaches might fail; because it's split up in ten-minute splices, there will probably be a huge number of files for long queries. 

Your JSON files now contain a series of JSON objects separated (sometimes) by newlines, and with a trailing JSON info object.

>```
{
  "id": "tag:search.twitter.com,2005:550494145895096320",
  "objectType": "activity",
  "actor": {
    "objectType": "person",
    "id": "id:twitter.com:465687760",
    "link": "http://www.twitter.com/NouranFaisal",
    "displayName": "Nouran Faisal",
    "postedTime": "2012-01-16T16:44:49.000Z",
    "image": "https://pbs.twimg.com/profile_images/464141975080689666/RkCDUpQN_normal.jpeg",
    "summary": null,
    "links": [
      {
        "href": null,
        "rel": "me"
      }
    ],
    "friendsCount": 99,
    "followersCount": 259,
    "listedCount": 1,
    "statusesCount": 7998,
    "twitterTimeZone": "Athens",
    "verified": false,
    "utcOffset": "7200",
    "preferredUsername": "NouranFaisal",
    "languages": [
      "en"
    ],
    "favoritesCount": 136
  },
  "verb": "share",
  "postedTime": "2015-01-01T03:30:06.000Z",
  "generator": {
    "displayName": "Twitter for iPhone",
    "link": "http://twitter.com/download/iphone"
  },
  "provider": {
    "objectType": "service",
    "displayName": "Twitter",
    "link": "http://www.twitter.com"
  },
  "link": "http://twitter.com/NouranFaisal/statuses/550494145895096320",
  "body": "RT @donnamaherr: Howa 3amtan akeed inshallah hatkoon a happy new year tool ma enta msh feeha.",
  "object": {
    "id": "tag:search.twitter.com,2005:550077998737948672",
    "objectType": "activity",
    "actor": {
      "objectType": "person",
      "id": "id:twitter.com:904035835",
      "link": "http://www.twitter.com/donnamaherr",
      "displayName": "Donna.",
      "postedTime": "2012-10-25T14:38:12.000Z",
      "image": "https://pbs.twimg.com/profile_images/540882702363525120/DCNfXLEH_normal.jpeg",
      "summary": "Instagram: donnamaherr",
      "links": [
        {
          "href": null,
          "rel": "me"
        }
      ],
      "friendsCount": 163,
      "followersCount": 12772,
      "listedCount": 36,
      "statusesCount": 12081,
      "twitterTimeZone": null,
      "verified": false,
      "utcOffset": null,
      "preferredUsername": "donnamaherr",
      "languages": [
        "en"
      ],
      "location": {
        "objectType": "place",
        "displayName": "Maadi"
      },
      "favoritesCount": 3367
    },
    "verb": "post",
    "postedTime": "2014-12-30T23:56:29.000Z",
    "generator": {
      "displayName": "Twitter for iPhone",
      "link": "http://twitter.com/download/iphone"
    },
    "provider": {
      "objectType": "service",
      "displayName": "Twitter",
      "link": "http://www.twitter.com"
    },
    "link": "http://twitter.com/donnamaherr/statuses/550077998737948672",
    "body": "Howa 3amtan akeed inshallah hatkoon a happy new year tool ma enta msh feeha.",
    "object": {
      "objectType": "note",
      "id": "object:search.twitter.com,2005:550077998737948672",
      "summary": "Howa 3amtan akeed inshallah hatkoon a happy new year tool ma enta msh feeha.",
      "link": "http://twitter.com/donnamaherr/statuses/550077998737948672",
      "postedTime": "2014-12-30T23:56:29.000Z"
    },
    "favoritesCount": 23,
    "twitter_entities": {
      "hashtags": [],
      "trends": [],
      "urls": [],
      "user_mentions": [],
      "symbols": []
    },
    "twitter_filter_level": "low",
    "twitter_lang": "en"
  },
  "favoritesCount": 0,
  "twitter_entities": {
    "hashtags": [],
    "trends": [],
    "urls": [],
    "user_mentions": [
      {
        "screen_name": "donnamaherr",
        "name": "Donna.",
        "id": 904035835,
        "id_str": "904035835",
        "indices": [
          3,
          15
        ]
      }
    ],
    "symbols": []
  },
  "twitter_filter_level": "medium",
  "twitter_lang": "en",
  "retweetCount": 52,
  "gnip": {
    "matching_rules": [
      {
        "value": "enta",
        "tag": "enta"
      }
    ],
    "klout_score": 17,
    "language": {
      "value": "en"
    }
  }
}

And the final line:


{"info":{"message":"Replay Request Completed","sent":"2015-06-15T23:27:55+00:00","activity_count":6}}