Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Accessing data older than 90 days from OpenAQ

Currently, only data within the last 90 days is available via the OpenAQ API. However, there is much more data available on OpenAQ and a variety of different access mechanisms. Note also that there is work under way to bring back to the API a mechanism to access the data older than 90 days, details here.

If you're looking to query across all the data or even easily export the data (or a subset of it), the easiest way to do that currently is using a service like Amazon Athena. I'll provide some directions on how to do that below, but at a high level, this will let you make any query of the entire dataset that you'd like (written in SQL). I'll also provide some sample queries so you can see what's possible.

On to the directions!

  1. You will need to create an AWS account if you don't currently have one, you can start this process at https://aws.amazon.com/free/. Note that while an AWS account does not cost any money, running a query using Athena will. At currernt OpenAQ volume sizes, querying the whole dataset costs ~$.01.

  2. Log in to your AWS account and go to the Athena console at https://console.aws.amazon.com/athena/home?region=us-east-1. Please make sure you're in the US East (N. Virginia) region as this will make your queries faster and save OpenAQ money on data transfer. You can confirm this by looking at the region name in the top right of the console.

  3. There is a lot more documentation on Athena here, but you should now be looking at the main console where you can enter any query. But first, you'll need to create a table on top of the OpenAQ data. To do this, you can copy and paste the table definition in the openaq.ddl file that's also in the gist. Copy and paste that and click Run Query. In the Results pane at the bottom of the window, you should see Query Successful.

  4. That's it, now you're ready to make any query you'd like! I'll provide some examples below to get you started.

Example queries

A simple query to just get all the data for a particular location (or city or country)

SELECT *
FROM openaq
WHERE location='US Diplomatic Post: Hanoi'

The average of all pm25 measurements for Baruun 4 zam location in Mongolia

SELECT avg(value) AS "avg_pm25"
FROM openaq
WHERE parameter='pm25'
        AND location='Baruun 4 zam';

The number of measurements over 35 µg/m³ in India

SELECT count(*) AS "num_over"
FROM openaq
WHERE parameter='pm25'
        AND value > 35
        AND country = 'IN'

Top 10 highest pm25 measurements sorted by highest and then date

SELECT *
FROM openaq
WHERE parameter='pm25'
ORDER BY  value desc, date.utc DESC limit 10

Athena also has geospatial support which means you can do things like ask for all the measurements around a certain latitude and longitude

SELECT *
FROM openaq
WHERE ST_WITHIN(ST_POINT(coordinates.longitude, coordinates.latitude), ST_BUFFER(ST_POINT(4.773,52.428), 0.1));

Notes

  • The great thing about this mechanism is that you don't need to duplicate any data, you simply query using Athena and it runs on the latest data that OpenAQ maintains.

  • Athena has the full power of SQL behind it, so you have a ton of power to write any query you want, you're not limited by an API that OpenAQ creates. More details on SQL in general here.

  • If you're looking to use the data from these queries outside of visual inspection in Athena, you can click the 'Download CSV' button on the top of the Results pane. The CSV results will also be stored in an Amazon S3 bucket on your account which you can programatically access if you'd like.

  • Note that the underlying data you pointed to may change its structure as we figure out the most performant way to make the data avaible. We'll announce any changes and if that does happen, you'd just need to copy/paste another openaq.ddl type file that'd update your table.

CREATE EXTERNAL TABLE `openaq`(
`date` struct<utc:string,local:string> COMMENT 'from deserializer',
`parameter` string COMMENT 'from deserializer',
`location` string COMMENT 'from deserializer',
`value` float COMMENT 'from deserializer',
`unit` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer',
`averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer',
`coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`sourcename` string COMMENT 'from deserializer',
`sourcetype` string COMMENT 'from deserializer',
`mobile` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755')
@TomasDob
Copy link

TomasDob commented Jun 13, 2020

Really appreciate the advice jflasher

Sorry to bother again. It is quite tough being new to this subject
How do I convert a string date into a date object, and then truncate it to the day to make sure things get average properly???

SELECT parameter, avg(value), count(value) as num_measurements
FROM openaq
WHERE location = 'Brugge' and date.local between '2020-01-01' and '2020-05-31'
GROUP BY date_trunc('day', from_iso8601_date(date.local)), parameter

@jflasher
Copy link
Author

jflasher commented Jun 13, 2020

@TomasDob
Copy link

TomasDob commented Jun 13, 2020

Unfortunately I get this error

line 3:50: extraneous input '‘' expecting {'(', '?', 'add', 'all', 'some', 'any', 'at', 'no', 'exists', 'null', 'true', 'false', 'substring', 'position', 'tinyint', 'smallint', 'integer', 'date', 'time', 'timestamp', 'interval', 'year', 'month', 'day', 'hour', 'minute', 'second', 'zone', 'current_date', 'current_time', 'current_timestamp', 'localtime', 'localtimestamp', 'extract', 'case', 'filter', 'over', 'partition', 'range', 'rows', 'preceding', 'following', 'current', 'row', 'schema', 'comment', 'view', 'replace', 'grant', 'revoke', 'privileges', 'public', 'option', 'explain', 'analyze', 'format', 'type', 'text', 'graphviz', 'logical', 'distributed', 'validate', 'cast', 'try_cast', 'show', 'tables', 'views', 'schemas', 'catalogs', 'columns', 'column', 'use', 'partitions', 'functions', 'to', 'system', 'bernoulli', 'poissonized', 'tablesample', 'array', 'map', 'set', 'reset', 'session', 'data', 'start', 'transaction', 'commit', 'rollback', 'work', 'isolation', 'level', 'serializable', 'repeatable', 'committed', 'uncommitted', 'read', 'write', 'only', 'call', 'input', 'output', 'cascade', 'restrict', 'including', 'excluding', 'properties', 'function', 'lambda_invoke', 'returns', 'sagemaker_invoke_endpoint', 'normalize', 'nfd', 'nfc', 'nfkd', 'nfkc', 'if', 'nullif', 'coalesce', '+', '-', string, binary_literal, integer_value, decimal_value, identifier, digit_identifier, quoted_identifier, backquoted_identifier, double_precision} (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: 37099eeb-1bcd-416a-9170-7489a4daeaff)

@TomasDob
Copy link

TomasDob commented Jun 13, 2020

Thanks a lot jflasher
I was able to figure it out '2020-01-01' instead of ‘2020-01-01’ to get the query running

However; I always get 0 results

@jflasher
Copy link
Author

jflasher commented Jun 13, 2020

@TomasDab yep, looks like the quote got messed up somewhere along the way. I have updated the query I posted originally to be a bit more complete and fix that issue.

You're getting 0 results because Brugge is not a location listed within our system. You can check the map (https://openaq.org/#/map) to see what the locations are called within the system, these are all set by the source providers of the data.

@TomasDob
Copy link

TomasDob commented Jun 13, 2020

Thanks jflasher
My bad about the location

however when trying to look fo any location I get an error
e.g CH0031A (which would be Bern, Switzerland)

Your query has the following error(s):

INVALID_FUNCTION_ARGUMENT: Invalid format: "2020-01-04T11:00:00+01:00" is malformed at "T11:00:00+01:00"

This query ran against the "polution" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 42ac6db5-faa1-4b53-8f43-4dfd652493a0.

SELECT parameter, avg(value), count(value) as num_measurements
FROM openaq
WHERE location = 'CH0031A' and date.local between '2020-01-01' and '2020-05-31'
GROUP BY date_trunc('day', from_iso8601_date(date.local)), parameter

@jflasher
Copy link
Author

jflasher commented Jun 13, 2020

@TomasDob
Copy link

TomasDob commented Jun 13, 2020

Great. It worked
Really appreciate your help!!!

@jflasher
Copy link
Author

jflasher commented Jun 13, 2020

@c21nm
Copy link

c21nm commented Jun 17, 2020

Is it possible to take daily averages instead of hourly ones through Amazon Athena? I'm new to SQL and I don't know how to do it.

@jflasher
Copy link
Author

jflasher commented Jun 17, 2020

Hi @c21nm, the code I provide in https://gist.github.com/jflasher/573525aff9a5d8a966e5718272ceb25a#gistcomment-3339482 does do daily averages. Also cc'ing @sruti here who is working on some averaging tools that may be of interest.

@amalie-create
Copy link

amalie-create commented Jun 19, 2020

Hi, I am new to SQL. How can I get hourly averages instead of daily averages like in your example @jflasher?

@jflasher
Copy link
Author

jflasher commented Jun 19, 2020

@TomasDob
Copy link

TomasDob commented Jun 28, 2020

Hey guys
One more question
Any idea how to deal with missing data, right know the days with missing values are just skipped and not provided at all
I would like the table to provide all daily values, if the value is missing it should state (e.g YYY or 995) instead of just skipping this date

appreciate your help

@jflasher
Copy link
Author

jflasher commented Jun 30, 2020

@RocketD0g
Copy link

RocketD0g commented Aug 11, 2020

An update: Data are now accessible from the OpenAQ API going back >2 years instead of 90 days.

@adjanip
Copy link

adjanip commented Oct 5, 2020

@jflasher I am trying to combine the advice you gave to TomasDob and scand363, but I am not getting the right thing.
I would like to get the average daily value (for all 7 variables) for all individual monitor locations in the Netherlands between January 16, 2020 and May 16, 2020.

I ran:

SELECT date_format(date_trunc('day', from_iso8601_timestamp(date.local)), '%Y-%m-%d') as day, parameter, avg(value) as average, count(value) as num_measurements
FROM openaq
WHERE country = 'NL' and date.local between '2020-01-16' and '2020-05-17'
GROUP BY date_trunc('day', from_iso8601_timestamp(date.local)), parameter
ORDER BY day asc, parameter

Which gives me correctly

  • all 7 variables
  • the correct dates

but incorrectly

  • only one value for each variable (I assume the national average of that variable for the whole country)
  • All the daily values are negative for all of the dates

@maelle
Copy link

maelle commented Oct 12, 2020

@RocketD0g @jflasher where is the new data availability documented btw? https://docs.openaq.org/#api-Measurements-GetV1Measurements still mention "Currently the API only supports a rolling 90 day archive of the data."? I'd like to update ropenaq (a bit late, sorry).

@RocketD0g
Copy link

RocketD0g commented Oct 12, 2020

Hey @maelle! I've made a PR for the changes, but I don't they haven't been added in yet. Here's some documentation though: https://medium.com/@openaq/openaq-extends-api-and-download-tool-access-from-90-days-to-2-years-3697540c85a3

@maelle
Copy link

maelle commented Oct 12, 2020

awesome, thank you!

@jflasher
Copy link
Author

jflasher commented Oct 12, 2020

@adjanip I believe the below will get you what you're looking for. I am filtering out values less than 0 (often error codes) which can erroneously affect the results. I am also grouping by location.

SELECT date_format(date_trunc('day', from_iso8601_timestamp(date.local)), '%Y-%m-%d') as day, location, parameter, avg(value) as average, count(value) as num_measurements
FROM openaq
WHERE country = 'NL' and date.local between '2020-01-16' and '2020-05-17' and value >= 0
GROUP BY date_trunc('day', from_iso8601_timestamp(date.local)), parameter, location
ORDER BY day asc, parameter

@Ester452
Copy link

Ester452 commented Oct 13, 2020

Hi! I get this error attached below when copying and pasting the openaq.ddl.
Cattura

I need for my master thesis the daily emission of the main air pollutant (pm25,pm20,NO2,CO) of the last 10 or 15 years for some italian's regions. I completely new to coding and I really have no idea how to formulate that! If anyone could help I'll be really really glad.
Thak you!

@jflasher
Copy link
Author

jflasher commented Oct 15, 2020

Hi @Ester452, it looks like you need to give your output bucket a more unique name so that it can be created. You may want to ask any questions over on the OpenAQ Slack channels (https://openaq-slackin.herokuapp.com/), this gist is not seen by most of the community.

@sruti
Copy link

sruti commented Oct 16, 2020

@Ester452 that slack link doesn't work anymore, use this one instead: https://join.slack.com/t/openaq/shared_invite/zt-gq14aew7-CVdp131g7TR7o9iiXIVDLw

@alfredocambera
Copy link

alfredocambera commented Dec 9, 2020

You can also use AWS Glue with the S3 bucket and it will create the DDL for the table.

@MaxPowerWasTaken
Copy link

MaxPowerWasTaken commented Apr 19, 2021

Hi Jeff,

I landed here from the recommendation at this blog post to use your ddl file to query OpenAQ data in Athena.

However, I notice this ddl query ends with:

TBLPROPERTIES (
  'transient_lastDdlTime'='1518373755')

...which is about three years ago, so I'm guessing I don't want to use that unchanged.

Would it be better to delete that TBLPROPERTIES statement altogether, or to keep it and sub in some more recent timestamp?

Thanks!

@jflasher
Copy link
Author

jflasher commented Apr 19, 2021

Hi @MaxPowerWasTaken you should be good to delete it or leave as is. It's just internal metadata, do not believe it matters in this case.

@MaxPowerWasTaken
Copy link

MaxPowerWasTaken commented Apr 21, 2021

Hi thanks @jflasher.

When I ran this ddl query (without the TBLPROPERTIES statement) I got the following error:
FAILED: ParseException line 5:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'

I am using the new Athena engine v2, so maybe that has something to do with it. In any case, here's how I got around that error, in case it's helpful to anyone else:

I cataloged the s3 data by using a glue crawler to register it as a hive table, just as the ddl statement would. Once glue cataloged the table, I was also able to use Athena to show the ddl statement that would create the table as cataloged by glue.

Running (in Athena) SHOW CREATE TABLE openaq_gz_realtime_gzipped; yields:

CREATE EXTERNAL TABLE `openaq_gz_realtime_gzipped`(
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` double COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<value:double,unit:string> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:double,longitude:double> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` boolean COMMENT 'from deserializer')
PARTITIONED BY ( 
  `partition_0` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://openaq-fetches/realtime-gzipped/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='OpenAQ-realtime-gz-crawler', 
  'averageRecordSize'='832', 
  'classification'='json', 
  'compressionType'='gzip', 
  'objectCount'='150401', 
  'recordCount'='26197990', 
  'sizeKey'='20582896694', 
  'typeOfData'='file')

@ShamamaAfnan
Copy link

ShamamaAfnan commented Nov 1, 2021

Hi, openaq ddl ran successfully. I got this error while running the example queries:
image

@cszc
Copy link

cszc commented Nov 4, 2021

Hi All. There were two things that I had trouble figuring out, so thought I'd share here in case it's helpful to anyone.

  • First, I wanted to be able to partition the table to speed up queries and limit gbs scanned. That wasn't super straightforward because the partitions aren't in a hive compatible format, i.e. day=yyyy-MM-dd. Instead, I had to use a partition projection with a custom partition template. You can see the projection definition below in the TBLPROPERTIES. This dramatically improved performance.
  • Secondly, I kept running into JSON errors similar to this one. By adding with serdeproperties ( 'ignore.malformed.json'='true' ), I was able to ignore the errant json so that my query would finish. The final ddl that I used is below. edit: @ShamamaAfnan Looks like the same problem you ran into
CREATE EXTERNAL TABLE IF NOT EXISTS `openaq` (
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` float COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` string COMMENT 'from deserializer')
PARTITIONED BY (day string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
with serdeproperties ( 'ignore.malformed.json'='true' )
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.range' = '2017-01-01,2021-11-01',
'projection.day.format' = 'yyyy-MM-dd',
'storage.location.template' = 's3://openaq-fetches/realtime-gzipped/${day}/'
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment