Skip to content

Instantly share code, notes, and snippets.

@atrisovic
Last active February 20, 2023 15:40
Show Gist options
  • Save atrisovic/93d379dd84e31f0d63b965de8d529777 to your computer and use it in GitHub Desktop.
Save atrisovic/93d379dd84e31f0d63b965de8d529777 to your computer and use it in GitHub Desktop.
Form to document new analytic data on FASSE

Step 1: Check analytic data

Is the data you need already on FASSE? Check out the catalog here: https://nsaph.info/analytic.html#analytic-data

If it is not, see step 2.

Step 2: Fill in the form below and add it in the comments here.

The format of the form goes like this:

* - key_name 
  - value 

Below is the form for analytic data documentation with key_names. Fill in the value fields or choose between the options.

One dataset should correspond to one form. If your dataset is spit into multiple files of a same format (ie, admissions_2011.fst, admissions_2012.fst etc), it is fine to complete one form.

* - dataset_name
  - a meaningful name (not filename)
* - dataset_author
  - Name Surname
* - date_created
  - Jun 15 2022
* - data_source
  - MedPar (admissions), MBSF (denominator), Medicaid MAX, other (specify)
* - spatial_coverage
  - US
* - spatial_resolution
  - zipcode, city, county, state
* - temporal_coverage
  - 1999-2016
* - temporal_resolution
  - daily, monthly, annually
* - description
  - Write in free text what (if any) processing was done to the data sources. Were there any selections (cuts), data quality checks and aggregations?
* - rce_location
  - `~/shared_space/TEXT`
* - fasse_location
  - `/n/dominici_nsaph_l3/projects/analytic/TEXT`

Optional fileds (choose as applicable):

* - publication (if this data was used in publication)
  - URL
* - GitHub repository/directory on how the data was processed
  - URL
* - exposures
  - What were the air pollution/exposure data sources used to create this data file? 
* - confounders
  - What were the confounder data sources used to create this dataset?
* - meterological
  - What were the meterological data sources used to create this data file?
* - other
  - What other data sources were used to create this data?
* - size
  - 1.2 GB
* - files
```
   ├── dataset_2011.fst
   ├── ...
   └── dataset_2016.fst
```
* - header (see in R with str(dat))
```
QID  : Factor 
ADATE: Date
year : num  
```

Embed the form here to get the JupyterBook (NSAPH handbook) entry for nsaph.info/analytic.html:

`````{dropdown} 1. Meaningful dataset name
```{list-table}
:header-rows: 0

COPY AND PASTE THE FORM HERE

````
`````
@JochemKlompmaker
Copy link

DATASETS1

    • dataset_name
    • aggregated ADRD cohort Medicare
    • dataset_author
    • Jochem Klompmaker
    • date_created
    • February 2022
    • data_source
    • MedPar (admissions), MBSF (denominator)
    • spatial_coverage
    • US
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 2000-2016
    • temporal_resolution
    • annually
    • processing_description
    • Denominator file linked with hospitalization data and merged with confounders and exposures (NDVI, blue space, park cover, NO2, PM2.5, ozone, temperature, humidity). Person records were aggregated by zip code, year and individual demographics
    • rce_location
    • ~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/alz2/
    • fasse_location
    • /n/dominici_nsaph_l3/projects/nature_adrd_poisson/data/
    • files
      ├── aggregate_ALZ.fst
      ├── aggregate_excl_1yr_hosp_ALZ.fst
      ├──aggregate_ALZ_65yrs.fst
      ├──aggregate_ALZ_75yrs.fst
      └──aggregate_ALZ_85yrs.fst

DATASETS2

    • dataset_name
    • aggregated PD cohort Medicare
    • dataset_author
    • Jochem Klompmaker
    • date_created
    • February 2022
    • data_source
    • MedPar (admissions), MBSF (denominator)
    • spatial_coverage
    • US
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 2000-2016
    • temporal_resolution
    • annually
    • processing_description
    • Denominator file linked with hospitalization data and merged with confounders and exposures (NDVI, blue space, park cover, NO2, PM2.5, ozone, temperature, humidity). Person records were aggregated by zip code, year and individual demographics
    • rce_location
    • ~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/par2/
    • fasse_location
    • /n/dominici_nsaph_l3/projects/nature_adrd_poisson/data/
    • files
      ├── aggregate_PAR.fst
      ├── aggregate_excl_1yr_hosp_PAR.fst
      ├──aggregate_PAR_65yrs.fst
      ├──aggregate_PAR_75yrs.fst
      └──aggregate_PAR_85yrs.fst

@wxwx1993
Copy link

wxwx1993 commented Oct 18, 2022

    • dataset_name
    • Daily County Level Heatwave Assosciated Hospitalizations
    • dataset_author
    • Ben Sabath
    • date_created
    • July 10, 2020
    • data_source
    • MedPar (admissions), MBSF (denominator), Medicaid MAX, other (specify)
    • spatial_coverage
    • US
    • spatial_resolution
    • county
    • temporal_coverage
    • 2006-2016, 1999-2016
    • temporal_resolution
    • daily
    • processing_description
  • FIPS code, race, sex, age, and dual eligibility were determined for each
    case based on the information in the patient summary file for that
    individual in the year of their admission. The denominator for each
    observation is calculated monthly and contains all individuals who are
    eligbile for Fee for Service (FFS) hospitalization coverage and have not
    died prior to that month. The CCS codes included were 2, 50, 55, 114, 157, 159, and 244.
    ICD processing done using the ICD package(Wasey 2018). The author of
    this package asks that it be cited in papers using data that was created
    using the package.
    • rce_location
    • ~/shared_space/ci3_health_data/medicare/heat_related
    • fasse_location
    • /n/dominici_nsaph_l3/projects/analytic/heat_related
    • publication (if this data was used in publication)
    • URL
    • GitHub repository/directory on how the data was processed
    • URL
    • size
    • 6.7 GB

@atrisovic
Copy link
Author

Hi All,

This all has been moved today:

"~/shared_space/ci3_health_data/medicaid/cvd/2010_2011/desouza-2",
"~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/par2/",
"~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/alz2/",
"~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/cbv2/",
"~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/chd2/",
"~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/Klompmaker/merged_data/cvd2/",
~/shared_space/ci3_health_data/medicare/heat_related

I updated the handbook - please all have a look at the latest entries at https://nsaph.info/analytic.html and make sure your data looks good. If you'd like to add more info (more documentation is always better), please do it directly on GitHub as a pull request (https://github.com/NSAPH/handbook).

@JochemKlompmaker I added a relative data paths (for all the 5 datasets) to your workspaces (temperature_cvd_poisson and nature_ADRD_poisson), so you should be able to access the data directly from there.

@atrisovic
Copy link
Author

@macork
Copy link

macork commented Oct 20, 2022

    • dataset_name
  • Aggregated 2000-2016 Medicare Mortality Data with PM2.5 Exposure and ZIP code level variables
    • dataset_author
  • Xiao Wu, Ben Sabath
    • date_created
  • 2020
    • data_source
  • Medicaid, Exposure Data, Census Data
    • spatial_coverage
    • US
    • spatial_resolution
  • zipcode
    • temporal_coverage
  • 2000-2016
    • temporal_resolution
  • Annually
    • processing_description
      See Xiao’s paper for processing description.
    • rce_location
    • ~/shared_space/ci3_mic6949/input_data/aggregate_data.RDS
    • fasse_location
    • /n/dominici_nsaph_l3/projects/ERC_Simulation/Medicare_data/aggregate_medicare_data_2000to2016
    • publication (if this data was used in publication)

@kateburrows
Copy link

kateburrows commented Oct 20, 2022

    • dataset_name
    • Daily Florida Hospitalization Counts by Zip
    • dataset_author
    • Ben Sabath, Kate Burrows
    • date_created
    • February 07 2020
    • data_source
    • MedPar (admissions), MBSF (denominator)
    • spatial_coverage
    • Florida
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 1999-2016
    • temporal_resolution
    • daily
    • processing_description
    • Denominator file linked with hospitalization data. This is the raw unprocessed data.
    • rce_location
    • ~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/burrows/cache_data
    • fasse_location
    • /n/dominici_nsaph_l3/projects/tc-hospitalization_disparities-poisson

@kevinleec
Copy link

    • dataset_name
    • IHD medicare hospitalizations (2005)
    • dataset_author
    • Cory Zigler
    • date_created
    • Oct 4 2018
    • data_source
    • MedPar (admissions)
    • spatial_coverage
    • US
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 2005
    • temporal_resolution
    • annually
    • processing_description
    • N/A
    • rce_location
    • ~/shared_space/ci3_analysis/zigler_lab/projects/BipartiteInterference_GPS/BipartiteInterference_GPS/Data/out.zip_pp.rda
    • fasse_location
    • /n/dominici_nsaph_l3/projects/emissions-ihd-bipartite

@atrisovic
Copy link
Author

@kateburrows and @kevinleec is there a git repository for your data, ie how it was created?

@seulkeeheo
Copy link

seulkeeheo commented Oct 21, 2022

    • dataset_name
    • Whanhee Lee’s data for hospitalization for kidney diseases
    • dataset_author
    • Ana Trisovic
    • date_created
    • Jun 15 2022
    • data_source
    • MedPar (admissions), MBSF (denominator)
    • spatial_coverage
    • US
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 2000-2016
    • temporal_resolution
    • annually
    • processing_description
    • Special modifications for the kidney diseases for numerators and denominators (people at risk) by Whanhee Lee.
    • rce_location
    • ~/shared_space/whanhee_revisions/data/final.csv
    • fasse_location
    • /n/dominici_nsaph_l3/projects/whanhee_kidney/final.csv

@kateburrows
Copy link

kateburrows commented Oct 23, 2022

@atrisovic no, there is not a repository for this dataset. Thanks for checking

@kevinleec
Copy link

@atrisovic Not to my knowledge, no!

@atrisovic
Copy link
Author

atrisovic commented Oct 24, 2022

@seulkeeheo the data was already on FASSE, so I documented it here: https://nsaph.info/analytic.html#hospitalizations-for-kidney-disease-and-comorbidities Thanks for filling in the form :)

@macork I don't see this folder in the shared space: ~/shared_space/ci3_mic6949
Are you sure this is the correct dir? Can you send me the full path?

@yycome
Copy link

yycome commented Oct 25, 2022

    • dataset_name
    • ZIP code-level PM2.5, PM2.5 components, ozone, and NO2 in the contiguous US
    • dataset_author
    • Yaguang Wei
    • date_created
    • Oct 19, 2022
    • data_source
    • Gridded PM2.5, PM2.5 components, ozone, and NO2; Esri ZIP code area and point files; U.S. ZIP code database.
    • spatial_coverage
    • US
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 2000-2016 for PM2.5, ozone, and NO2; 2000-2019 for PM2.5 components.
    • temporal_resolution
    • daily, annually
    • processing_description
    • For general ZIP Codes with a polygon representation, we estimated their pollution levels by averaging the predictions of grid cells whose centroids lie inside the polygon of that ZIP Code; For other ZIP Codes such as Post Offices or large volume single customers, we treated them as a single point and predicted their pollution levels by assigning the predictions of the nearest grid cell.
    • These are updated ZIP code-level predictions. We filled in the missing values for grids, and added about 200 zip codes that are missing in the Esri files each year. The geographic information for the additional zip codes is extracted from US ZIP code database.
    • rce_location
    • NA
    • fasse_location
    • To be uploaded

@macork
Copy link

macork commented Oct 25, 2022

@atrisovic Hmm yeah I'm sure the directory is ~/shared_space/ci3_mic6949/input_data/aggregate_data.RDS. Is there a way that this directory is not public?

@atrisovic
Copy link
Author

Moved:

  • ~/shared_space/whanhee_revisions/data/final.csv
  • ~/shared_space/ci3_analysis/zigler_lab/projects/BipartiteInterference_GPS/BipartiteInterference_GPS/Data
  • ZIP code-level PM2.5, PM2.5 components, ozone, and NO2 in the contiguous US
  • ~/shared_space/ci3_health_data/medicare/gen_admission/1999_2016/burrows/cache_data

To go:

  • ~/shared_space/ci3_mic6949/input_data/aggregate_data.RDS

@lhenneman
Copy link

    • dataset_name
    • coal pm2.5 source impacts
    • dataset_author
    • Lucas Henneman
    • date_created
    • Sep 14 2022
    • data_source
    • HyADS exposure modeling
    • spatial_coverage
    • US
    • spatial_resolution
    • zipcode
    • temporal_coverage
    • 1999-2020
    • temporal_resolution
    • annually
    • processing_description
    • NA
    • rce_location
    • /nfs/home/H/henneman/shared_space/ci3_nsaph/LucasH/disperseR/main/output/zips_model.lm.cv_single_poly
    • fasse_location
    • /n/dominici_nsaph_l3/projects/analytic/coal_exposure_pm25
    • publication (if this data was used in publication)
    • NA
    • exposures
    • This was created with the HyADS model using emissions from EPA's CAMD database
    • confounders
    • What were the confounder data sources used to create this dataset?
    • meterological
    • NOAA/NCAR reanalysis data
    • other
    • What other data sources were used to create this data?
    • size
    • 1.2 GB
    • files
   ├── zips_pm25_total_1999.fst
   ├── ...
   └── zips_pm25_total_2020.fst
   ├── zips_pm25_byunit_1999.fst
   ├── ...
   └── zips_pm25_byunit_2020.fst

@seulkeeheo
Copy link

dataset_name
Whanhee Lee’s data for hospitalization for kidney diseases
dataset_author
Ana Trisovic
date_created
Jun 10 2022
data_source
MedPar (admissions), MBSF (denominator)
spatial_coverage
US
spatial_resolution
zipcode
temporal_coverage
2000-2016
temporal_resolution
annually
processing_description
Special modifications for the kidney diseases for numerators and denominators (people at risk) by Whanhee Lee.
rce_location
~/shared_space/whanhee_revisions/data/final_JUL10.csv
fasse_location
/n/dominici_nsaph_l3/projects/whanhee_kidney/final_JUL10.csv

@atrisovic
Copy link
Author

@seulkeeheo
Copy link

seulkeeheo commented Nov 1, 2022

Hey @seulkeeheo ,
This is the same file. It was already transferred. See https://nsaph.info/analytic.html#hospitalizations-for-kidney-disease-and-comorbidities

Hi @atrisovic Thanks for confirming! As the data file name was different, I was not sure. Now I will move on to deleting files in RCE.

@daniellebraun
Copy link

@atrisovic if the data file is different i dont think its the same file, the documentation you refer to is final.csv, and @seulkeeheo is asking for final_JUL10.csv.

@atrisovic
Copy link
Author

Hi @daniellebraun I renamed it. It is for sure the same.

@daniellebraun
Copy link

im a bit confused/worried about reproducibility, is the file called final.csv or final_JUL10.csv?

@daniellebraun
Copy link

and in your git its actually "data/final_backup.csv"
now im even more worried...

@seulkeeheo
Copy link

seulkeeheo commented Nov 1, 2022

and in your git its actually "data/final_backup.csv"
now im even more worried...

Sorry. I am also worried that I found several files that look like a dataset Whanhee used in his latest analysis. So far, I have only found 'final.csv' and 'final_JUL10.csv' in Whanhee's RCE folder.

@atrisovic
Copy link
Author

Hi @seulkeeheo and @daniellebraun, there is nothing to be concerned about. If needed, I could clean up the RCE folders.
I renamed the file "final_JUL10.csv" into "final.csv" before storing it in /analytic and documenting it in the catalog.

@daniellebraun
Copy link

in order to be able to reproduce the pipeline the file names should match the ones in your python code, which is final_backup.csv not final or final_JUL10.csv, renaming files by hand is horrible practice and will create a lot of problem down the line. why would you rename the file? it also creates issues if whanhee's code relies on final_JUL10.csv. this is still VERY concerning. and it seems like in the RCE folder there is both final.csv and final_JUL10.csv, so did you move final_JUL10.csv and then rename it?

@marissachilds
Copy link

  • dataset_name
    • Predicted daily smoke PM2.5 over the Contiguous US, 2006 - 2020
  • dataset_author
    • Marissa Childs
  • date_created
    • October 24, 2020
  • data_source
    • other (exposure predictions)
  • spatial_coverage
    • Contiguous US
  • spatial_resolution
    • originally 10km. aggregated to zcta, census tract, and county by area and population-weighted averages
  • temporal_coverage
    • 2006 - 2020
  • temporal_resolution
    • daily
  • processing_description
    • none
  • rce_location
    • ??
  • fasse_location
    • ??
  • publication (if this data was used in publication)
  • GitHub repository/directory on how the data was processed
  • exposures
    • PM2.5 from smoke

@atrisovic
Copy link
Author

@lhenneman and @macork your data has now been transferred and documented at https://nsaph.info/analytic.html

~/shared_space/ci3_mic6949/input_data/aggregate_data.RDS
/nfs/home/H/henneman/shared_space/ci3_nsaph/LucasH/disperseR/main/output/zips_model.lm.cv_single_poly

@danielmork
Copy link

danielmork commented Jan 11, 2023

    • dataset_name
    • Space weather data
    • dataset_author
    • Carolina L Zilli Vieira
    • date_created
    • Oct 17 2022
    • data_source
    • NASA (solar and geomagnetic activity parameters), DAAC NASA (solar radiation), BARTOL Neutron Station (neutrons)
    • Is this all from the same source? In either case URLs are needed.
    • spatial_coverage
    • Global UTC converted to local time
    • spatial_resolution
    • zipcode, city, county, state
    • You download data in all these resolutions or you aggregated it?
    • temporal_coverage
    • 1996-2022
    • temporal_resolution
    • daily, monthly, annually
    • Same as spatial resultion? Is this original format or derived?
    • processing_description
    • raw data converted to local time
    • Carolina's email suggests the data is not processed?
    • fasse_location
    • /n/dominici_nsaph_l3/exposures/solar_activity
    • size
    • TBD

@vieiraclz
Copy link

Data URL: Solar activity data: [https://omniweb.gsfc.nasa.gov/form/dx1.html]
Neutron data: https://neutronm.bartol.udel.edu/
Solar radiation (https://daac.ornl.gov/)

Yes, we processed the data in UTC to US time zone data. From this source, it is not possible to have spatial data. To do so, we converted UTC global data to US local time data. Then we used these local time zone data to county data. The numbers change a little by location based in the time zone.

We provided daily data, which can be aggregated them to monthly and annual data.

Please let me know if there is anything unclear yet.
Carolina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment