Skip to content

Instantly share code, notes, and snippets.

@bsweger
Last active June 5, 2024 19:54
Show Gist options
  • Save bsweger/d53c4c1174b9f59eed7a8505503d37cf to your computer and use it in GitHub Desktop.
Save bsweger/d53c4c1174b9f59eed7a8505503d37cf to your computer and use it in GitHub Desktop.
Accessing Cloud-Based Hubverse Data

Cloud-Based Hubs

The Hubverse is in the processing of making hub data available via publicly-accessible AWS S3 buckets.

Cloud-based hubs "mirror" the data stored in a hub's GitHub repository and provide a few advantages for data consumers:

  • No need to clone a repository to access data
  • Cloud-based model-output files are in parquet format, which is easier to work with and more performant

The examples here use the CDC's FluSight Forecast Hub, which is available in the following S3 bucket:

s3://cdcepi-flusight-forecast-hub

Other files in the gist provide examples of accessing cloud-based hub data via DuckDB, Python, and R.

Alternately, you can use the AWS CLI to copy the files from S3 to your local machine for use with your favorite data analysis software.

List the available directories in hub's S3 bucket:

aws s3 ls s3://cdcepi-flusight-forecast-hub/ --no-sign-request
                           PRE auxiliary-data/
                           PRE hub-config/
                           PRE model-metadata/
                           PRE model-output/
                           PRE raw/
                           PRE target-data/

Copy files from a specific directory (in this case, model-output) to a local machine:

aws s3 cp s3://cdcepi-flusight-forecast-hub/model-output/ . --recursive --no-sign-request

Using SQL to access S3-based model-output files

Setup

Install the DuckDB CLI: https://duckdb.org/docs/installation/?version=stable&environment=cli

Query the model-output parquet files directly from S3

  1. Open the DuckDB CLI:

    duckdb
  2. Install the httpfs extension that allows cloud-based data access:

    INSTALL httpfs;
  3. Override the http_keep_alive setting to prevent errors when accessing many S3 files simultaneously:

    SET http_keep_alive=false;

Ad-hoc query

Run a regular SQL query, specifying the S3 location in the FROM clause (more information about DuckDB's read_parquet feature):

SELECT reference_date, target, count(*) as num_submissions
FROM read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/*/*.parquet', filename=true, union_by_name=true)
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

Note: to query data for a specific model, it will be faster to add the model_id to the S3 path in the FROM clause than to add the model_id to the WHERE clause (because the database engine needs to read far less data from disk).

For example:

SELECT reference_date, target, count(*) as num_submissions
FROM read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/Umass-flusion/*.parquet', filename=true, union_by_name=true)
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

RECOMMENDED: Creating an in-memory table for faster interactive exploration

Pulling large chunks for data from S3 can take a while. If you plan to make repetitive queries, you can create an in-memory table (so you only need to pull the data from S3 once).

CREATE TABLE model_output AS
SELECT *
FROM read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/*/*.parquet', filename=true, union_by_name=true); 

Then you can query the in-memory model_output table directly:

SELECT reference_date, target, count(*) as num_submissions
FROM model_output
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

Saving query output

Detailed information: https://duckdb.org/docs/api/cli/dot_commands#output-writing-results-to-a-file

For example, saving query results to a .csv:

.mode csv
.once model_output_data.csv

SELECT reference_date, target, count(*) as num_submissions
-- this example uses a previously-created in-memory table
FROM model_output
WHERE location = 'US' and output_type = 'quantile'
GROUP BY ALL;

Additional information about using the DuckDB CLI: https://duckdb.org/docs/api/cli/overview.html

Using Python to access S3-based model-output files

pyarrow and pandas

The python ecosystem isn't as forgiving as hubData and DuckDB about handling parquet files with different schemas (open issue).

For now, you can do the following to get model-output data into a pandas dataframe:

Use the pandas read_parquet method to return a group of parquet files as a dataframe:

import pandas as pd
import pyarrow.parquet as pq

# Create a pyarrow table from a single model-output file that has a str datatype for both output_type_id and location
mo_table = pq.read_table('s3://cdcepi-flusight-forecast-hub/model-output/UMass-trends_ensemble/2024-05-04-UMass-trends_ensemble.parquet')

# Pass the above table's schema into the pandas read_parquet method
mo = pd.read_parquet('s3://cdcepi-flusight-forecast-hub/model-output/', schema=mo_table.schema)

Using R to access S3-based model-output files

hubData package

hubData is a hubverse-maintained R package that presents model-output data as a tibble

To access the model-output data:

library(dplyr)
library(hubData)

hub_path_cloud <- s3_bucket('cdcepi-flusight-forecast-hub/')
data_cloud <- connect_hub(hub_path_cloud, file_format="parquet") %>% collect()

filtered <- data_cloud %>%
filter(output_type == "quantile", location == "US")

To access target data:

library(aws.s3)
target_data <- aws.s3::s3read_using(read.csv, object="s3://cdcepi-flusight-forecast-hub/target-data/target-hospital-admissions.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment