These instructions will help you better analyze the IRS 990 public dataset. The first thing you'll want to do is to read through the documentation over at Amazon. There's a ~108MB index file called index.json.gz that contains metadata describing the entire corpus.
To download the index.json.gz metadata file, you'll want to issue the following command: curl https://s3.amazonaws.com/irs-form-990/index.json.gz
. Once you've downloaded the index.json.gz file, you can extract its contents with the following command: gunzip index.json.gz
. To take a peek at the extracted contents, use the following command: head index.json
.
Looking at the index.json file, you'll notice that it contains a json structure represented as a string. It contains an array of json objects that look like the following:
{"EIN": "721221647", "SubmittedOn": "2016-02-05", "TaxPeriod": "201412", "DLN": "93493309001115", "LastUpdated": "2016-03-21T17:2