Skip to content

Instantly share code, notes, and snippets.

@jflasher
Created February 23, 2018 14:56
Show Gist options
  • Star 18 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save jflasher/573525aff9a5d8a966e5718272ceb25a to your computer and use it in GitHub Desktop.
Save jflasher/573525aff9a5d8a966e5718272ceb25a to your computer and use it in GitHub Desktop.
Accessing data older than 90 days from OpenAQ

Currently, only data within the last 90 days is available via the OpenAQ API. However, there is much more data available on OpenAQ and a variety of different access mechanisms. Note also that there is work under way to bring back to the API a mechanism to access the data older than 90 days, details here.

If you're looking to query across all the data or even easily export the data (or a subset of it), the easiest way to do that currently is using a service like Amazon Athena. I'll provide some directions on how to do that below, but at a high level, this will let you make any query of the entire dataset that you'd like (written in SQL). I'll also provide some sample queries so you can see what's possible.

On to the directions!

  1. You will need to create an AWS account if you don't currently have one, you can start this process at https://aws.amazon.com/free/. Note that while an AWS account does not cost any money, running a query using Athena will. At currernt OpenAQ volume sizes, querying the whole dataset costs ~$.01.

  2. Log in to your AWS account and go to the Athena console at https://console.aws.amazon.com/athena/home?region=us-east-1. Please make sure you're in the US East (N. Virginia) region as this will make your queries faster and save OpenAQ money on data transfer. You can confirm this by looking at the region name in the top right of the console.

  3. There is a lot more documentation on Athena here, but you should now be looking at the main console where you can enter any query. But first, you'll need to create a table on top of the OpenAQ data. To do this, you can copy and paste the table definition in the openaq.ddl file that's also in the gist. Copy and paste that and click Run Query. In the Results pane at the bottom of the window, you should see Query Successful.

  4. That's it, now you're ready to make any query you'd like! I'll provide some examples below to get you started.

Example queries

A simple query to just get all the data for a particular location (or city or country)

SELECT *
FROM openaq
WHERE location='US Diplomatic Post: Hanoi'

The average of all pm25 measurements for Baruun 4 zam location in Mongolia

SELECT avg(value) AS "avg_pm25"
FROM openaq
WHERE parameter='pm25'
        AND location='Baruun 4 zam';

The number of measurements over 35 µg/m³ in India

SELECT count(*) AS "num_over"
FROM openaq
WHERE parameter='pm25'
        AND value > 35
        AND country = 'IN'

Top 10 highest pm25 measurements sorted by highest and then date

SELECT *
FROM openaq
WHERE parameter='pm25'
ORDER BY  value desc, date.utc DESC limit 10

Athena also has geospatial support which means you can do things like ask for all the measurements around a certain latitude and longitude

SELECT *
FROM openaq
WHERE ST_WITHIN(ST_POINT(coordinates.longitude, coordinates.latitude), ST_BUFFER(ST_POINT(4.773,52.428), 0.1));

Notes

  • The great thing about this mechanism is that you don't need to duplicate any data, you simply query using Athena and it runs on the latest data that OpenAQ maintains.

  • Athena has the full power of SQL behind it, so you have a ton of power to write any query you want, you're not limited by an API that OpenAQ creates. More details on SQL in general here.

  • If you're looking to use the data from these queries outside of visual inspection in Athena, you can click the 'Download CSV' button on the top of the Results pane. The CSV results will also be stored in an Amazon S3 bucket on your account which you can programatically access if you'd like.

  • Note that the underlying data you pointed to may change its structure as we figure out the most performant way to make the data avaible. We'll announce any changes and if that does happen, you'd just need to copy/paste another openaq.ddl type file that'd update your table.

CREATE EXTERNAL TABLE `openaq`(
`date` struct<utc:string,local:string> COMMENT 'from deserializer',
`parameter` string COMMENT 'from deserializer',
`location` string COMMENT 'from deserializer',
`value` float COMMENT 'from deserializer',
`unit` string COMMENT 'from deserializer',
`city` string COMMENT 'from deserializer',
`attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer',
`averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer',
`coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer',
`country` string COMMENT 'from deserializer',
`sourcename` string COMMENT 'from deserializer',
`sourcetype` string COMMENT 'from deserializer',
`mobile` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755')
@jflasher
Copy link
Author

jflasher commented Jun 13, 2020 via email

Copy link

ghost commented Jun 17, 2020

Is it possible to take daily averages instead of hourly ones through Amazon Athena? I'm new to SQL and I don't know how to do it.

@jflasher
Copy link
Author

Hi @c21nm, the code I provide in https://gist.github.com/jflasher/573525aff9a5d8a966e5718272ceb25a#gistcomment-3339482 does do daily averages. Also cc'ing @sruti here who is working on some averaging tools that may be of interest.

@amalie-create
Copy link

Hi, I am new to SQL. How can I get hourly averages instead of daily averages like in your example @jflasher?

@jflasher
Copy link
Author

jflasher commented Jun 19, 2020 via email

@TomasDob
Copy link

Hey guys
One more question
Any idea how to deal with missing data, right know the days with missing values are just skipped and not provided at all
I would like the table to provide all daily values, if the value is missing it should state (e.g YYY or 995) instead of just skipping this date

appreciate your help

@jflasher
Copy link
Author

jflasher commented Jun 30, 2020 via email

@RocketD0g
Copy link

An update: Data are now accessible from the OpenAQ API going back >2 years instead of 90 days.

@adjanip
Copy link

adjanip commented Oct 5, 2020

@jflasher I am trying to combine the advice you gave to TomasDob and scand363, but I am not getting the right thing.
I would like to get the average daily value (for all 7 variables) for all individual monitor locations in the Netherlands between January 16, 2020 and May 16, 2020.

I ran:

SELECT date_format(date_trunc('day', from_iso8601_timestamp(date.local)), '%Y-%m-%d') as day, parameter, avg(value) as average, count(value) as num_measurements
FROM openaq
WHERE country = 'NL' and date.local between '2020-01-16' and '2020-05-17'
GROUP BY date_trunc('day', from_iso8601_timestamp(date.local)), parameter
ORDER BY day asc, parameter

Which gives me correctly

  • all 7 variables
  • the correct dates

but incorrectly

  • only one value for each variable (I assume the national average of that variable for the whole country)
  • All the daily values are negative for all of the dates

@maelle
Copy link

maelle commented Oct 12, 2020

@RocketD0g @jflasher where is the new data availability documented btw? https://docs.openaq.org/#api-Measurements-GetV1Measurements still mention "Currently the API only supports a rolling 90 day archive of the data."? I'd like to update ropenaq (a bit late, sorry).

@RocketD0g
Copy link

Hey @maelle! I've made a PR for the changes, but I don't they haven't been added in yet. Here's some documentation though: https://medium.com/@openaq/openaq-extends-api-and-download-tool-access-from-90-days-to-2-years-3697540c85a3

@maelle
Copy link

maelle commented Oct 12, 2020

awesome, thank you!

@jflasher
Copy link
Author

@adjanip I believe the below will get you what you're looking for. I am filtering out values less than 0 (often error codes) which can erroneously affect the results. I am also grouping by location.

SELECT date_format(date_trunc('day', from_iso8601_timestamp(date.local)), '%Y-%m-%d') as day, location, parameter, avg(value) as average, count(value) as num_measurements
FROM openaq
WHERE country = 'NL' and date.local between '2020-01-16' and '2020-05-17' and value >= 0
GROUP BY date_trunc('day', from_iso8601_timestamp(date.local)), parameter, location
ORDER BY day asc, parameter

@Ester452
Copy link

Hi! I get this error attached below when copying and pasting the openaq.ddl.
Cattura

I need for my master thesis the daily emission of the main air pollutant (pm25,pm20,NO2,CO) of the last 10 or 15 years for some italian's regions. I completely new to coding and I really have no idea how to formulate that! If anyone could help I'll be really really glad.
Thak you!

@jflasher
Copy link
Author

Hi @Ester452, it looks like you need to give your output bucket a more unique name so that it can be created. You may want to ask any questions over on the OpenAQ Slack channels (https://openaq-slackin.herokuapp.com/), this gist is not seen by most of the community.

@sruti
Copy link

sruti commented Oct 16, 2020

@Ester452 that slack link doesn't work anymore, use this one instead: https://join.slack.com/t/openaq/shared_invite/zt-gq14aew7-CVdp131g7TR7o9iiXIVDLw

@alfredocambera
Copy link

You can also use AWS Glue with the S3 bucket and it will create the DDL for the table.

@MaxPowerWasTaken
Copy link

MaxPowerWasTaken commented Apr 19, 2021

Hi Jeff,

I landed here from the recommendation at this blog post to use your ddl file to query OpenAQ data in Athena.

However, I notice this ddl query ends with:

TBLPROPERTIES (
  'transient_lastDdlTime'='1518373755')

...which is about three years ago, so I'm guessing I don't want to use that unchanged.

Would it be better to delete that TBLPROPERTIES statement altogether, or to keep it and sub in some more recent timestamp?

Thanks!

@jflasher
Copy link
Author

Hi @MaxPowerWasTaken you should be good to delete it or leave as is. It's just internal metadata, do not believe it matters in this case.

@MaxPowerWasTaken
Copy link

Hi thanks @jflasher.

When I ran this ddl query (without the TBLPROPERTIES statement) I got the following error:
FAILED: ParseException line 5:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'

I am using the new Athena engine v2, so maybe that has something to do with it. In any case, here's how I got around that error, in case it's helpful to anyone else:

I cataloged the s3 data by using a glue crawler to register it as a hive table, just as the ddl statement would. Once glue cataloged the table, I was also able to use Athena to show the ddl statement that would create the table as cataloged by glue.

Running (in Athena) SHOW CREATE TABLE openaq_gz_realtime_gzipped; yields:

CREATE EXTERNAL TABLE `openaq_gz_realtime_gzipped`(
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` double COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<value:double,unit:string> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:double,longitude:double> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` boolean COMMENT 'from deserializer')
PARTITIONED BY ( 
  `partition_0` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='attribution,averagingPeriod,city,coordinates,country,date,location,mobile,parameter,sourceName,sourceType,unit,value') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://openaq-fetches/realtime-gzipped/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='OpenAQ-realtime-gz-crawler', 
  'averageRecordSize'='832', 
  'classification'='json', 
  'compressionType'='gzip', 
  'objectCount'='150401', 
  'recordCount'='26197990', 
  'sizeKey'='20582896694', 
  'typeOfData'='file')

@ShamamaAfnan
Copy link

Hi, openaq ddl ran successfully. I got this error while running the example queries:
image

@cszc
Copy link

cszc commented Nov 4, 2021

Hi All. There were two things that I had trouble figuring out, so thought I'd share here in case it's helpful to anyone.

  • First, I wanted to be able to partition the table to speed up queries and limit gbs scanned. That wasn't super straightforward because the partitions aren't in a hive compatible format, i.e. day=yyyy-MM-dd. Instead, I had to use a partition projection with a custom partition template. You can see the projection definition below in the TBLPROPERTIES. This dramatically improved performance.
  • Secondly, I kept running into JSON errors similar to this one. By adding with serdeproperties ( 'ignore.malformed.json'='true' ), I was able to ignore the errant json so that my query would finish. The final ddl that I used is below. edit: @ShamamaAfnan Looks like the same problem you ran into
CREATE EXTERNAL TABLE IF NOT EXISTS `openaq` (
  `date` struct<utc:string,local:string> COMMENT 'from deserializer', 
  `parameter` string COMMENT 'from deserializer', 
  `location` string COMMENT 'from deserializer', 
  `value` float COMMENT 'from deserializer', 
  `unit` string COMMENT 'from deserializer', 
  `city` string COMMENT 'from deserializer', 
  `attribution` array<struct<name:string,url:string>> COMMENT 'from deserializer', 
  `averagingperiod` struct<unit:string,value:float> COMMENT 'from deserializer', 
  `coordinates` struct<latitude:float,longitude:float> COMMENT 'from deserializer', 
  `country` string COMMENT 'from deserializer', 
  `sourcename` string COMMENT 'from deserializer', 
  `sourcetype` string COMMENT 'from deserializer', 
  `mobile` string COMMENT 'from deserializer')
PARTITIONED BY (day string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
with serdeproperties ( 'ignore.malformed.json'='true' )
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://openaq-fetches/realtime-gzipped'
TBLPROPERTIES (
'transient_lastDdlTime'='1518373755',
'projection.enabled' = 'true',
'projection.day.type' = 'date',
'projection.day.range' = '2017-01-01,2021-11-01',
'projection.day.format' = 'yyyy-MM-dd',
'storage.location.template' = 's3://openaq-fetches/realtime-gzipped/${day}/'
)

@kirazcloud
Copy link

Hi,
If you get any error row json object invalid
Row is not a valid JSON Object - JSONException: Unterminated string at 201 [character 202 line 1]
Try to add this one 'ignore.malformed.json'='true'
https://stackoverflow.com/questions/71102999/ignore-malformated-json-from-aws-athena-query

@jflasher
Copy link
Author

cc/ @russbiggs so he knows this exists.

@russbiggs
Copy link

OpenAQ will be deprecating this bucket and access method in the near future. An updated bucket and explanation of how to access the data archive via AWS Athena and the AWS CLI is available at docs.openaq.org/docs/accessing-openaq-archive-data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment