malisas/g2p_issue_discussion.rst Secret

## g2p_issue_discussion.rst

      
    Raw
  

              g2p_issue_discussion.rst
            
          
    Roadmap


Guidance

1) Feature's featureType field does not require term
Ensembl's server:
All fields in featureType (which is of type OntologyTerm) are present, e.g.:
"featureType": {
  "sourceName": "Sequence Ontology",
  "id": "SO:0001060",
  "sourceVersion": "",
  "term": "sequence_variant"
}

OHSU server:
The term is not stored, e.g. in the case below for sequence_alteration:
"featureType": {
  "id": "http://purl.obolibrary.org/obo/SO_0001059",
  "sourceName": “SO",
  "sourceVersion": null,
  "term": null
}

Conclusion: It is acceptable to only store the id and not the term of a featureType OntologyTerm for a Feature, as the id is the identifying attribute and is sufficient. Term may be useful for creating GFF files from responses.

Issues

1) String Searches are undefined
SearchGenotypePhenotypeRequest currently accepts a string (i.e. not an array of strings) as one of the possible query values for genotype, phenotype, and evidence. What types of strings are acceptable, and which fields of a FeaturePhenotypeAssociation are searched against?
Ensembl server:
Note: The Ensembl server is currently completely string-search based; the strings are largely placeholders for the other GA4GH data types.:
{ "feature": "rs6920220",  "phenotype": "http://www.ebi.ac.uk/efo/EFO_0003767", "evidence" : "", "pageSize": 10 }

For the genotype portion of the query:
Currently matching SNP-external-identifier strings (e.g. "rs6920220") to X? field in a Feature (as opposed to using the ExternalIdentifierQuery data structure).
This is a self-acknowledged placeholder: "I am regarding rs6920220 as a placeholder for an Ensembl GA4GH feature id - I would expect the string to be a GA4GH feature id as this is consistent with the rest of the API.
I see people querying to extract features on a region then using the feature ids to extract associations, or extracting variant  annotation and looking up the associations for overlapping features by id."
To note, other people working on G2P have been discussing the use of featureId's as input into SearchGenotypePhenotypeRequest, although they have been envisioning featureId input as more of a replacement for the array of fully expanded Features in GenomicFeatureQuery, not the singleton string search.
For the phenotype portion of the query:
The url of an experimental factor ontology (EFO) was used for phenotype, "http://www.ebi.ac.uk/efo/EFO_0003767".
Once again, this is a placeholder for an OntologyTermQuery.
"Again I would expect the string to be the PhenotypeInstance id."
OHSU server:
For the phenotype portion of the query:
Currently, simple strings are matched to phenotype labels
e.g.:
<http://ohsu.edu/cgd/5c895709> a OBO:OMIM_606764 ;
    rdfs:label "GIST with sensitivity to therapy" ;
    OBO:BFO_0000159 <http://ohsu.edu/cgd/sensitivity> .

Thoughts from Ensemble:
"I would be very wary of disease labels to look up associations as they look like simple text descriptions - it's hard to tell if no results means the phenotype string wasn't recognised or there aren't any associations. A phenotype search by label/synonym followed by an association search by id feels cleaner."
Conclusion:
Current guidance on string queries is "Each query can be made against a string, an array of external identifers (such as for gene or SNP ids), ontology term ids, or full feature/phenotype/evidence objects."
Implementation experience informs us that while external identifiers, ontology terms and objects are fairly specific, the use of a simple string is open to interpretation.
String searches appear to be moving in a direction where they are less of a "Google" search across all fields in a FeaturePhenotypeAssociation, and moreso a match to one specific label or field.
The use of featureId as input in some way is desirable by multiple parties, although it is unclear if it should be used for a string search or a GenomicFeatureQuery field. Also to note, the variant_annotation group is already returning featureId's (as oposed to entire Feature objects) as part of SearchVariantAnnotationsResponse response
Further investigation revealed that two different GA4GH server refererring to the same genomic location will return two different featureId values (confirm!).   Caution must be taken to ensure the featureId does not become yet another synonym.
2) How to represent Feature that has several types?
For the OHSU dataset, The following feature is tagged with both sequence_alteration and missense_variant (neither of which are children of the other in the Sequence Ontology):
<http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=965> a OBO:SO_0001059,
           OBO:SO_0001583,
           <http://www.w3.org/2002/07/owl#NamedIndividual> ;

It is not valid to store multiple id's in a featureType, e.g.:
"featureType": {
  "id": [
 "http://purl.obolibrary.org/obo/SO_0001583",
  "http://purl.obolibrary.org/obo/SO_0001059",
  "http://www.w3.org/2002/07/owl#NamedIndividual"
  ],
  "sourceName": null,
   "sourceVersion": null,
   "term": null
 }

The Ensembl server does not have multiple types: "We have not hit this problem - we are only using functional_variant part of SO in the variant annotation records. It makes sense to just use sequence_feature and it's child terms for the feature types. I guess this isn't a great solution if you don't have enough information for a full annotation record as the missense information would be lost."
Conclusion: Some Features have multiple types, and it is unclear how to store them. Perhaps featureType can be an array of OntologyTerms instead of just one OntologyTerm? (In that case, it is unclear how a search using only one OntologyTerm would match against a featureType of multiple OntologyTerms)

Recommendation:   change Feature.featureType in  sequenceAnnotations.avdl
array<OntologyTerm> featureType;

3) No way for the client to look up which sets of external identifiers a server supports
As Sarah Hunt mentioned, "There's no way for the client to look up which sets of external identifiers a server supports". This speaks to a general issue that each server needs to be defined in terms of which ontology sources and external identifier sources are being used by that particular server. This information needs to get to the user somehow.
4) Cases of minimal information for filling out Feature objects
As a pragmatic point, the implementation of GA4GH should not require additional curation by the maintainers of evidence databases.   These databases are driven by physicians' notes, research papers, and publication NLP processes.  Importantly, they are maintained and developed separately from any particular omics representation of a given cohort.  :underline:`Therefore they often have minimum information of the feature, often just a name or description.`  Two alternatives present themselves, allow the G2P query to return a flexible representation of feature or provide a service that will return a fully formed feature given inputs of name/description/term/identifier.
For the OHSU server, seems like both KIT and PDGFRA are involved in GIST, but not sure how they are related via the ttl file.
Not sure how this relates to response below::
<http://ohsu.edu/cgd/055b872c> a OBO:SO_0001059 ;
    rdfs:label "PDGFRA  wild type no mutation" ;
    OBO:RO_0002200 CGD:da20b3e8 .

Response::
{
  "attributes": {
    "vals": {
      "http://www.w3.org/2000/01/rdf-schema#label": "KIT  wild type no mutation",
    }
  },
  "childIds": [],
  "end": null,
  "featureSetId": null,
  "featureType": {
    "id": "http://purl.obolibrary.org/obo/SO_0001059",
    "sourceName": "OBO",
    "sourceVersion": null,
    "term": null
  },
  "id": "http://ohsu.edu/cgd/27d2169c",
  "parentId": null,
  "referenceName": null,
  "start": 0,
  "strand": null
}

5) Evidence representation
Ensembl server::
{
  "evidenceType": {
    "sourceName": "IAO",
    "id": "http://purl.obolibrary.org/obo/IAO_0000311",
    "sourceVersion": null,
    "term": "publication"
  },
  "description": "PMID:23128233"
}

PMID (the external identifier) is currently stored in the description field. The evidenceType id is publication, also seen as the term.
OHSU server::
"evidence": [
  {
    "description": "sensitivity",
    "evidenceType": {
       "id": "http://ohsu.edu/cgd/30ebfd1a",
      "sourceName": "CGD",
      "sourceVersion": "cgd-2015-12-01-17-34",
       "term": "http://purl.obolibrary.org/obo/ECO_0000033"
     }
   }
]

Here the evidenceType term is a url for traceable author statement. The value that you would consider an external identifier is stored in the id field of evidenceType: "http://ohsu.edu/cgd/30ebfd1a" (in contrast to the Ensembl server which uses the description field). The evidence description is "sensitivity"
"Here we attempted to denote the fact that we have evidence of sensitivity between the P/G/Env and a traceable author statement (ECO_0000033).  It looks like we both ran into an issue of where to store the pubmed link.    We do not currently have evidence  of p-term or odds.   It does seem that both these and machine learning evidence need a way to express that."
Conclusion: Both servers ran into the issue of where to store the external identifier links (e.g. PMID). To note, an identifiers field has been added to Evidence (as well as to PhenotypeInstance) by people working on the schema, and this will hopefully be accepted into the regular g2p schema. Also unclear where to store things like p-values.
Recommendation:  A minimal change would be to add a key/value array of values to Evidence
6) How would we model machine learning and statistical data?
Sarah Hunt: "Has anyone started thinking about modeling statistical and machine learning data? As I said before, I'd be fine to stick it all in a basic key-value structure initially as that is cleaner than what I'm doing currently and can be superceded when we have something more specific."