Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@pgroves
Last active October 15, 2020 04:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pgroves/3dadf5a851f11b9d19947c2d4bebcbd9 to your computer and use it in GitHub Desktop.
Save pgroves/3dadf5a851f11b9d19947c2d4bebcbd9 to your computer and use it in GitHub Desktop.
Q for Gen3 Forum on data-dictionary for VCF

Hello Everyone,

I am doing a small demo project that will hopefully include a gen3 server running from compose-services to pull data stored in a VCF from gen3, do some analysis, and upload the output to a data type that I define in a new dictionary. If all goes well we would go after funding for a more permanent installation later on.

First, here is the current state of the world around the data dictionaries (DD), as far as I can tell:

  • compose-services is configured by URL to point to a DD stored in s3 [1]. This does not contain the VCF definition. Is this meant to be kept up to date? It has some SHAs in the '_settings', but I don't know if those identify a git commit, or what repo.
  • compose-services can optionally use a DD stored as files in the same directory, and there is one checked into the repo [2]. This version also does not contain VCF and is at least 14 months old.
  • The 'main site' [3] shows a DD with a VCF, but is there a URL like [1] that I can configure my instance to pull from? Where are the source yaml files for this version?
  • The uc-cdis/dataditionary repo [4] has no VCF.yaml, but does have some related types like 'submitted_somatic_mutation'. I am surprised that what I see at gen3.datacommons.io/dd does not match these files, but I don't know if they are actually supposed to match.
  • The nci-gdc/gdcdictionary [5] has even more mutation-related types, and they have data_format=VCF, but nothing actually named 'VCF' like what I see in the dictionary browser in [3].
  • I also can see that the uc-cdis and nci-gdc github repos have diverged greatly (the fork is " 111 commits ahead, 719 commits behind NCI-GDC:develop.")

At this point I'd be interested if there was a simple story of what is important and what isn't in the above list.

Otherwise, my real question is: what's the most up-to-date set of schemas that you would recommend to start with if I need to represent the simplest possible mutations that will originally be in a VCF file?

Thank you for any assistance.

[1] https://s3.amazonaws.com/dictionary-artifacts/datadictionary/develop/schema.json

[2] https://github.com/uc-cdis/compose-services/tree/master/datadictionary/gdcdictionary/schemas

[3] https://gen3.datacommons.io/dd

[4] https://github.com/uc-cdis/datadictionary/tree/develop/gdcdictionary/schemas

[5] https://github.com/NCI-GDC/gdcdictionary/tree/develop/gdcdictionary/schemas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment