Skip to content

Instantly share code, notes, and snippets.

@jsstevenson
Last active December 18, 2023 22:23
Show Gist options
  • Save jsstevenson/2ac1bddabe86b0d253704778d2fa36da to your computer and use it in GitHub Desktop.
Save jsstevenson/2ac1bddabe86b0d253704778d2fa36da to your computer and use it in GitHub Desktop.

notes

Each kind of response is slightly different, but this tries to make them more consistent in a few ways:

  • No more gene vs normalized gene object. Everything is a GA4GH core Gene. This means no more associated_with vs xref, so one less kind of MatchType.
  • The outermost level includes the query, additional parameters passed to the API endpoint (I think (...?) this is good practice to include) and service information
  • The outermost level also includes a match key that points to what the individual Python QueryHandler methods would return. IMO it makes more sense to move this stuff into the REST API response because these are things that you don't otherwise typically include in Python-to-Python methods (e.g. another class doesn't need to know what version of Gene Normalizer is running, it's literally sharing the environment).
  • match objects include source metadata and warnings. In some of the responses, we have previously included source metadata closer to the actual source matches, but I think it might be simpler to just keep them in the same place no matter what kind of response is returning.
  • Warnings are a bit more standardized. We should define (enumerate) legitimate warning types as needed (descriptions can vary based on specifics)

The awkward part is that each type of search has a different relationship between the returned objects and the match_type. In search, match_type is given to a source, and we return every gene with that match_type under that source*. In normalize, match_type is associated to a single normalized gene. In normalize_unmerged, we return a bunch of genes under a bunch of sources, but the match_type corresponds to the match for the normalized gene that groups those genes together. Anyway, this makes it hard to set a good, consistent place to hold the match_type, since its semantics differ slightly across each source. I'd also like to avoid putting it directly into the individual Gene objects -- would much prefer to stick to the VRS/GA4GH models.

* We have an issue somewhere to return all matches and match types for every source. I think we should do this eventually, and should plan accordingly now even if we don't implement it

{
"query": "ACHE",
"match": {
"match_type": 100,
"normalized_id": "hgnc:108",
"warnings": [
{
"type": "type of warning here",
"description": "describe it"
}
],
"gene": {
"type": "Gene",
"id": "normalize.gene.hgnc:108",
"label": "ACHE",
"mappings": [
{
"coding": { "code": "ENSG00000087085", "system": "ensembl" },
"relation": "relatedMatch"
},
{
"coding": { "code": "43", "system": "ncbigene" },
"relation": "relatedMatch"
},
{
"coding": { "code": "OTTHUMG00000157033", "system": "vega" },
"relation": "relatedMatch"
},
{
"coding": { "code": "uc003uxi.4", "system": "ucsc" },
"relation": "relatedMatch"
},
{
"coding": { "code": "CCDS5710", "system": "ccds" },
"relation": "relatedMatch"
},
{
"coding": { "code": "CCDS64736", "system": "ccds" },
"relation": "relatedMatch"
},
{
"coding": { "code": "CCDS5709", "system": "ccds" },
"relation": "relatedMatch"
},
{
"coding": { "code": "P22303", "system": "uniprot" },
"relation": "relatedMatch"
},
{
"coding": { "code": "1380483", "system": "pubmed" },
"relation": "relatedMatch"
},
{
"coding": { "code": "100740", "system": "omim" },
"relation": "relatedMatch"
},
{
"coding": { "code": "S09.979", "system": "merops" },
"relation": "relatedMatch"
},
{
"coding": { "code": "2465", "system": "iuphar" },
"relation": "relatedMatch"
},
{
"coding": { "code": "NM_015831", "system": "refseq" },
"relation": "relatedMatch"
}
],
"aliases": ["3.1.1.7", "YT", "N-ACHE", "ARACHE", "ACEE"],
"extensions": [
{
"name": "previous_symbols",
"value": ["ACEE", "YT"],
"type": "Extension"
},
{
"name": "approved_name",
"value": "acetylcholinesterase (Cartwright blood group)",
"type": "Extension"
},
{ "name": "symbol_status", "value": "approved", "type": "Extension" },
{
"name": "ncbi_locations",
"value": [
{
"id": "ga4gh:SL.U7vPSlX8eyCKdFSiROIsc9om0Y7pCm2g",
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
},
"start": 100889993,
"end": 100896994
}
],
"type": "Extension"
},
{
"name": "ensembl_locations",
"value": [
{
"id": "ga4gh:SL.dnydHb2Bnv5pwXjI4MpJmrZUADf5QLe1",
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
},
"start": 100889993,
"end": 100896974
}
],
"type": "Extension"
},
{
"name": "ncbi_gene_type",
"type": "Extension",
"value": "protein-coding"
},
{
"name": "hgnc_locus_type",
"type": "Extension",
"value": "gene with protein product"
},
{
"name": "ensembl_biotype",
"type": "Extension",
"value": "protein_coding"
},
{ "name": "strand", "type": "Extension", "value": "-" }
]
},
"source_meta": {
"hgnc": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
},
"ensembl": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
},
"ncbi": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
}
}
},
"service_meta_": {
"name": "gene-normalizer",
"version": "0.3.0-dev1",
"response_datetime": "2023-09-29 14:53:07.329897",
"url": "https://github.com/cancervariants/gene-normalization"
}
}
{
"query": "ACHE",
"match": {
"match_type": 100,
"normalized_id": "hgnc:108",
"warnings": [
{
"type": "type of warning here",
"description": "describe it"
}
],
"source_genes": {
"hgnc_genes": [
{
"type": "Gene",
"id": "hgnc:108",
"label": "ACHE",
"mappings": [
{
"coding": { "code": "ENSG00000087085", "system": "ensembl" },
"relation": "relatedMatch"
},
{
"coding": { "code": "43", "system": "ncbigene" },
"relation": "relatedMatch"
}
],
"aliases": ["3.1.1.7", "YT", "N-ACHE", "ARACHE", "ACEE"],
"extensions": [
{
"name": "previous_symbols",
"value": ["ACEE", "YT"],
"type": "Extension"
},
{
"name": "symbol_status",
"value": "approved",
"type": "Extension"
},
{
"name": "hgnc_locus_type",
"type": "Extension",
"value": "gene with protein product"
},
{ "name": "strand", "type": "Extension", "value": "-" }
]
}
],
"ensembl_genes": [
{
"type": "Gene",
"id": "ncbi.gene:43",
"label": "ACHE",
"mappings": [
{
"coding": { "code": "ENSG00000087085", "system": "ensembl" },
"relation": "relatedMatch"
},
{
"coding": { "code": "43", "system": "ncbigene" },
"relation": "relatedMatch"
},
{
"coding": { "code": "OTTHUMG00000157033", "system": "vega" },
"relation": "relatedMatch"
},
{
"coding": { "code": "uc003uxi.4", "system": "ucsc" },
"relation": "relatedMatch"
},
{
"coding": { "code": "CCDS5710", "system": "ccds" },
"relation": "relatedMatch"
}
],
"aliases": ["3.1.1.7", "YT", "N-ACHE", "ARACHE", "ACEE"],
"extensions": [
{
"name": "ncbi_locations",
"value": [
{
"id": "ga4gh:SL.U7vPSlX8eyCKdFSiROIsc9om0Y7pCm2g",
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
},
"start": 100889993,
"end": 100896994
}
],
"type": "Extension"
},
{
"name": "ncbi_gene_type",
"type": "Extension",
"value": "protein-coding"
},
{ "name": "strand", "type": "Extension", "value": "-" }
]
}
],
"ncbi_genes": [
{
"type": "Gene",
"id": "ncbi.gene:43",
"label": "ACHE",
"mappings": [
{
"coding": { "code": "ENSG00000087085", "system": "ensembl" },
"relation": "relatedMatch"
},
{
"coding": { "code": "43", "system": "ncbigene" },
"relation": "relatedMatch"
},
{
"coding": { "code": "OTTHUMG00000157033", "system": "vega" },
"relation": "relatedMatch"
},
{
"coding": { "code": "uc003uxi.4", "system": "ucsc" },
"relation": "relatedMatch"
},
{
"coding": { "code": "CCDS5710", "system": "ccds" },
"relation": "relatedMatch"
}
],
"aliases": ["3.1.1.7", "YT", "N-ACHE", "ARACHE", "ACEE"],
"extensions": [
{
"name": "ncbi_locations",
"value": [
{
"id": "ga4gh:SL.U7vPSlX8eyCKdFSiROIsc9om0Y7pCm2g",
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
},
"start": 100889993,
"end": 100896994
}
],
"type": "Extension"
},
{
"name": "ncbi_gene_type",
"type": "Extension",
"value": "protein-coding"
},
{ "name": "strand", "type": "Extension", "value": "-" }
]
}
]
},
"source_meta": {
"hgnc": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
},
"ensembl": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
},
"ncbi": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
}
}
},
"service_meta_": {
"name": "gene-normalizer",
"version": "0.3.0-dev1",
"response_datetime": "2023-09-29 14:53:07.329897",
"url": "https://github.com/cancervariants/gene-normalization"
}
}
# selected:
class MatchType(IntEnum):
"""Define string constraints for use in Match Type attributes."""
CONCEPT_ID = 100
SYMBOL = 100
PREV_SYMBOL = 80
ALIAS = 60
XREF = 60
# ASSOCIATED_WITH = 60
FUZZY_MATCH = 20
NO_MATCH = 0
class WarningType(StrEnum):
"""Define possible warning types."""
MULTIPLE_NORMALIZED_CONCEPTS = "multiple_normalized_concepts_found"
NBSP = "non_breaking_space_characters"
class Warning(BaseModel):
"""Define warning structure."""
type: WarningType
description: StrictStr
class _Service(BaseModel):
query: StrictStr
additional_params: Optional[Dict] = None
service_meta_: ServiceMeta
class MatchSourceMeta(BaseModel):
hgnc: Optional[SourceMeta]
ensembl: Optional[SourceMeta]
ncbi: Optional[SourceMeta]
class _Match(BaseModel):
warnings: Optional[List[Warning]] = None
source_meta: MatchSourceMeta
class SourceSearchMatch(BaseModel):
match_type: MatchType
genes: List[Gene] = []
class SearchMatch(_Match):
hgnc_matches: Optional[SourceSearchMatch] = None
ensembl_matches: Optional[SourceSearchMatch] = None
ncbi_matches: Optional[SourceSearchMatch] = None
class SearchService(_Service):
matches: SearchMatch
class NormalizeMatch(_Match):
match_type: MatchType
normalized_id: Optional[CURIE]
gene: Optional[Gene]
class NormalizeService(_Service):
match: NormalizeMatch
class NormalizeUnmergedMatches(BaseModel):
hgnc_genes: Optional[List[Gene]] = None
ensembl_genes: Optional[List[Gene]] = None
ncbi_genes: Optional[List[Gene]] = None
class NormalizedUnmergedMatch(_Match):
match_type: MatchType
normalized_id: CURIE
source_genes: NormalizeUnmergedMatches
class NormalizeUnmergedService(_Service):
match: NormalizedUnmergedMatch
{
"query": "ACHE",
"additional_params": {
"sources": "ncbi,hgnc"
},
"match": {
"warnings": [
{
"type": "type of the warning",
"description": "description of the warning"
}
],
"hgnc_matches": {
"match_type": 100,
"genes": [
{
"type": "Gene",
"id": "hgnc:108",
"label": "ACHE",
"mappings": [
{
"coding": { "code": "ENSG00000087085", "system": "ensembl" },
"relation": "relatedMatch"
},
{
"coding": { "code": "43", "system": "ncbigene" },
"relation": "relatedMatch"
}
],
"aliases": ["3.1.1.7", "YT", "N-ACHE", "ARACHE", "ACEE"],
"extensions": [
{
"name": "previous_symbols",
"value": ["ACEE", "YT"],
"type": "Extension"
},
{
"name": "symbol_status",
"value": "approved",
"type": "Extension"
},
{
"name": "hgnc_locus_type",
"type": "Extension",
"value": "gene with protein product"
},
{ "name": "strand", "type": "Extension", "value": "-" }
]
}
]
},
"ncbi_matches": {
"match_type": 100,
"genes": [
{
"type": "Gene",
"id": "ncbi.gene:43",
"label": "ACHE",
"mappings": [
{
"coding": { "code": "ENSG00000087085", "system": "ensembl" },
"relation": "relatedMatch"
},
{
"coding": { "code": "43", "system": "ncbigene" },
"relation": "relatedMatch"
},
{
"coding": { "code": "OTTHUMG00000157033", "system": "vega" },
"relation": "relatedMatch"
},
{
"coding": { "code": "uc003uxi.4", "system": "ucsc" },
"relation": "relatedMatch"
},
{
"coding": { "code": "CCDS5710", "system": "ccds" },
"relation": "relatedMatch"
}
],
"aliases": ["3.1.1.7", "YT", "N-ACHE", "ARACHE", "ACEE"],
"extensions": [
{
"name": "ncbi_locations",
"value": [
{
"id": "ga4gh:SL.U7vPSlX8eyCKdFSiROIsc9om0Y7pCm2g",
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.F-LrLMe1SRpfUZHkQmvkVKFEGaoDeHul"
},
"start": 100889993,
"end": 100896994
}
],
"type": "Extension"
},
{
"name": "ncbi_gene_type",
"type": "Extension",
"value": "protein-coding"
},
{ "name": "strand", "type": "Extension", "value": "-" }
]
}
]
},
"source_meta": {
"hgnc": {
"data_license": "custom",
"data_license_url": "https://www.genenames.org/about/",
"version": "20221019",
"data_url": "ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/json/hgnc_complete_set.json",
"rdp_url": null,
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": []
},
"ncbi": {
"data_license": "custom",
"data_license_url": "https://www.ncbi.nlm.nih.gov/home/about/policies/",
"version": "20221021",
"data_url": "ftp://ftp.ncbi.nlm.nih.gov",
"rdp_url": "https://reusabledata.org/ncbi-gene.html",
"data_license_attributes": {
"non_commercial": false,
"attribution": false,
"share_alike": false
},
"genome_assemblies": ["GRCh38.p14"]
}
}
},
"service_meta_": {
"name": "gene-normalizer",
"version": "0.3.0-dev1",
"response_datetime": "2023-09-29 14:53:07.329897",
"url": "https://github.com/cancervariants/gene-normalization"
}
}
@korikuzma
Copy link

  • I agree with this. For the match_type, maybe we could create a object for it containing the score + what it describes?
  • For mappings, we should rethink the system. In GK-Pilot data, Alex had used URLs (i.e. https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI_Thesaurus&code=). Do we want to do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment