Skip to content

Instantly share code, notes, and snippets.

@zheins
Last active August 23, 2019 03:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zheins/4bd6767b3a49d588343f27082503fbb5 to your computer and use it in GitHub Desktop.
Save zheins/4bd6767b3a49d588343f27082503fbb5 to your computer and use it in GitHub Desktop.
GSoC 2019: ETL pipeline development for TCGA data from GDC Portal

ETL pipeline development for TCGA data from GDC Portal

Presentation Slides:

https://drive.google.com/open?id=1x99OQp9IIniSfEB5qC9lVWWtfw7ZHheW

GitHub Repository:

https://github.com/cBioPortal/gdc-et-pipeline

Commits:

https://github.com/cBioPortal/gdc-et-pipeline/commit/dcbbf40fc0b6dc6d2dc0d331d688308ea05e41f7 https://github.com/cBioPortal/gdc-et-pipeline/commit/dcfb059dd943509b9a563ea56b8f62b61dcd71f1 https://github.com/cBioPortal/gdc-et-pipeline/commit/58efdf1a98980861b05603a2f7795820a8c8d867 https://github.com/cBioPortal/gdc-et-pipeline/commit/a37b123c10ea534a964000eaf34b9b3c2fdd010c https://github.com/cBioPortal/gdc-et-pipeline/commit/46967455fc2e2188a35918c49116a07e13e9387e https://github.com/cBioPortal/gdc-et-pipeline/commit/ceea119e1f29107325985d108562577376ee1a10 https://github.com/cBioPortal/gdc-et-pipeline/commit/0eaf558d2b135bd4a74a3b940f730ce5da475bd0 https://github.com/cBioPortal/gdc-et-pipeline/commit/e7b0d87c7f795ac20ded697b929d6a0506a4a121 https://github.com/cBioPortal/gdc-et-pipeline/commit/e58311a357a4d3034824ff17d17f736f09606413

All commits were merged to the master branch of the repo.

Description of work:

This project expanded on an existing pipeline developed by a previous GSoC student, creating new Spring Batch reader/processor/writers and corresponding pipeline logic for CNA and mRNA expression data. The project also brought up to date and improved existing steps, including using the GDC GraphQL endpoing to fetch clinical data and relevant file and sample metadata.

I also fixed issues with the mutation step to allow variants to be properly annotated via Genome Nexus. Genome Nexus currently only supports GRCh37, while the GDC uses GRCh38.

Completed:

  • Updated manifest and clinical steps to use GDC GraphQL endpoint
  • Updated mutation step to annotate properly through Genome Nexus
  • Added two new steps for transformation of CNA and Expression Data
  • Pipeline successfully runs to completions, transforming all four currently supported data types (Clinical, Mutation, CNA, Expression)

Another major success of the project was getting the pipeline running on a Memorial Sloan Kettering server. The project stagnated after initial development - having a working pipeline on cBioPortal organization servers will hopefully motivate further development and use.

TODO:

  • Run the pipeline generated files through the cBioPortal validator tool
  • Annotate mutations in bulk via Genome Nexus
  • Dynamic, external calls to clinical data dictionary for clinical field metadata
  • Testing, unit + integration
  • Download data files dynamicall if desired using the GDC API instead of through the data download tool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment