Skip to content

Instantly share code, notes, and snippets.

@bwalsh
Last active August 1, 2019 21:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bwalsh/de6a59c4e57ee4f3e6b1ccd67a2a195d to your computer and use it in GitHub Desktop.
Save bwalsh/de6a59c4e57ee4f3e6b1ccd67a2a195d to your computer and use it in GitHub Desktop.

CDA Entity and Property Intersection

In order to inform the discussions regarding entity mapping and harmonization, we surveyed existing schemas.

Methodology: import relevant schemas, apply minor synonyms (e.g. Case renamed to Subject )

image

Projects with entities in common (field match %):

gdc,pdc,crdc

  • Aliquot (13%)
  • Demographic (29%)
  • Diagnosis (29%)
  • Program (9%)
  • Project (3%)
  • Publication (0%)
  • Sample (44%)
  • Subject (6%)

pdc,crdc

  • Protocol (42%)
  • Study (6%)
  • StudyRunMetadata (25%)

gdc,crdc

  • AlignedReads (22%)
  • AlignedReadsIndex (25%)
  • AlignmentCocleaningWorkflow (26%)
  • AlignmentWorkflow (26%)
  • CopyNumberEstimate (24%)
  • CopyNumberSegment (20%)
  • CopyNumberVariationWorkflow (14%)
  • FollowUp (6%)
  • GeneExpression (22%)
  • ProteinExpression (28%)
  • ReadGroup (80%)
  • ReadGroupQc (61%)
  • RnaExpressionWorkflow (29%)
  • SubmittedAlignedReads (25%)
  • SubmittedGenomicProfile (25%)
  • SubmittedUnalignedReads (28%)
  • Treatment (21%)

gdc,pdc

  • Clinical (9%)
  • File (7%)

Projects with unique entities:

crdc

  • Acknowledgement
  • AggregatedGenotypingArray
  • CoreMetadataCollection
  • DrugAttribute
  • DrugResponse
  • GenotypingArray
  • GenotypingArrayWorkflow
  • Keyword
  • MirnaMicroarray
  • MrnaMicroarray
  • MzmlProteinMassSpectrometry
  • OncomapAssay
  • OncomapPanel
  • ProteomicWorkflow
  • PsmProteinMassSpectrometry
  • RawProteinMassSpectrometry
  • SubmittedMethylation
  • SummaryDrugResponse
  • TangentCopyNumber

pdc

  • AliquotRunMetadata
  • Biospecimen
  • CasePerFile
  • ClinicalMetadata
  • ExperimentProjects
  • ExperimentType
  • ExperimentalMetadata
  • FileCount
  • FileMetadata
  • FilePerStudy
  • Filter
  • FilterElement
  • Gene
  • GeneStudySpectralCount
  • Paginated
  • Pagination
  • PdcDataStats
  • Ptm
  • QuantitiveData
  • Query
  • SearchRecord
  • Spectral_count
  • StudyExperimentalDesign
  • Sunburst
  • WorkflowMetadata

gdc

  • AggregatedSomaticMutation
  • AnalysisMetadata
  • Analyte
  • AnnotatedSomaticMutation
  • Annotation
  • Archive
  • BiospecimenSupplement
  • Center
  • ClinicalSupplement
  • CopyNumberLiftoverWorkflow
  • DataFormat
  • DataSubtype
  • DataType
  • ExperimentMetadata
  • ExperimentalStrategy
  • Exposure
  • FamilyHistory
  • FilteredCopyNumberSegment
  • GenomicProfileHarmonizationWorkflow
  • GermlineMutationCallingWorkflow
  • MaskedSomaticMutation
  • MethylationArrayHarmonizationWorkflow
  • MethylationBetaValue
  • MethylationLiftoverWorkflow
  • MirnaExpression
  • MirnaExpressionWorkflow
  • MolecularTest
  • PathologyReport
  • Platform
  • Portion
  • RawMethylationArray
  • RunMetadata
  • SimpleGermlineVariation
  • SimpleSomaticMutation
  • Slide
  • SlideImage
  • SomaticAggregationWorkflow
  • SomaticAnnotationWorkflow
  • SomaticCopyNumberWorkflow
  • SomaticMutationCallingWorkflow
  • SomaticMutationIndex
  • StructuralVariantCallingWorkflow
  • StructuralVariation
  • SubmittedGenotypingArray
  • SubmittedMethylationBetaValue
  • SubmittedTangentCopyNumber
  • Tag
  • TissueSourceSite

version info

  • GDC
$ git remote -v
origin	https://github.com/NCI-GDC/gdcdictionary (fetch)
origin	https://github.com/NCI-GDC/gdcdictionary (push)
$ git status
On branch develop
Your branch is up to date with 'origin/develop'.

nothing to commit, working tree clean
$ git log -1 | head -1
commit efef28495bbbdb11efc6580f9fc06d6d68d6a3bd
  • crdc
# see https://github.com/uc-cdis/cdis-manifest  <project>/
r = requests.get("https://s3.amazonaws.com/dictionary-artifacts/dcfdictionary/3.1.2/schema.json")
crdc = AttrDict(r.json())

crdc_entities = {}
for k, e in crdc.items():
    if 'id' in e and 'properties' in e:
        crdc_entities[to_camel_case(e['id'])] = [k for k in e['properties'].keys() if k != '$ref']

  • pdc
r = requests.get('https://pdc.esacinc.com/graphql?query={ __schema { types { name kind fields { name } } } }')
pdc = r.json()

# xform 
pdc_entities = {t['name'].replace('UI',''): [f['name'] for f in t['fields']] for t in pdc['data']['__schema']['types'] if t['kind'] == 'OBJECT' and not t['name'].startswith('_')}
pdc_entities['Subject'] = pdc_entities['Case']
del pdc_entities['Case']

pdc_entities['Study'] = pdc_entities['Experiment']
del pdc_entities['Experiment']

Sample mapping Subject

crdc.subject pdc.case gdc.case notes
breed harmonize to ontology
days_to_lost_to_followup days_to_lost_to_followup
disease_type disease_type disease_type harmonize to ontology
id case_id id
index_date index_date
lost_to_followup lost_to_followup
primary_site primary_site primary_site harmonize to ontology
project_id normalize to subject.project
species harmonize to ontology
state case_status state
studies normalize to subject.study
submitter_id case_submitter_id submitter_id
tissue_source_site_code tissue_source_sites
aliquot_id normalize to subject.sample.aliquot
aliquot_status normalize to subject.sample.aliquot
aliquot_submitter_id normalize to subject.sample.aliquot
program_name normalize to subject.project.program
project_name normalize to subject.project
sample_id normalize to subject.sample
sample_status normalize to subject.sample
sample_submitter_id normalize to subject.sample
sample_type normalize to subject.sample
batch_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment