Skip to content

Instantly share code, notes, and snippets.

@mkroon1
Created May 30, 2016 10:27
Show Gist options
  • Save mkroon1/04b1ccad246b6d57f607bb7ce1c1eb28 to your computer and use it in GitHub Desktop.
Save mkroon1/04b1ccad246b6d57f607bb7ce1c1eb28 to your computer and use it in GitHub Desktop.
proposal Mutalyzer API

RNA sequence reference to identify transcript variant

Problem

Mutalyzer uses transcript variants to disambiguate descriptions on (non-)coding DNA and protein level 2. Currently, gene symbols with optional transcript order number or protein isoform (as provided in the reference sequence annotation) are used to identify transcript variants 1.

LOVD uses the Mutalyzer API for many features. Per transcript it stores the gene symbol, transcript accession number and transcript order number. Transcript variant annotations are unstable and may change over time, causing LOVD to use an invalid transcript number.

Example variant on coding DNA in LOVD:

UD_132118785483(NPHP4_v001):c.2902G>A

The transcript variant in this example (NPHP4_v001) is outdated (as we assume it used to be correct). Mutalyzer.nl Namechecker complains when given this variant as input: "G not found at position 120212, found T instead."

Changing the transcript variant to NPHP4_v005 (which is annotated with a transcript sequence) makes the error go away.

Proposed solution

Have Mutalyzer recognize an RNA sequence reference as transcript variant. To continue the example above, the client (LOVD) can call the Mutalyzer Namechecker API with the following variant description:

UD_132118785483(NM_015102.3):c.2902G>A

The response would be:

# Genomic description
UD_132118785483:g.122458G>A

# Alternative chromosomal position
NC_000001.10:g.5935076C>T

# Affected transcripts
UD_132118785483(NPHP4_v005):c.2902G>A
...

# Affected proteins
UD_132118785483(NPHP4_i005):p.(Ala968Thr)
...

Note that the descriptions in the response use the old-style transcript variant notation.

Implementation details

Tasks:

  1. Extend grammar to include transcript accession number as transcript variant. I.e.:
    # RefSeqAcc  -> (GI | AccNo | UD | LRG) (`(' GeneSymbol `)')?
    GenBankRef = (GI ^ AccNo ^ UD) + Optional(GeneSymbol)
    RefSeqAcc = GenBankRef ^ LRG

Will change into something like:

    # RefSeqAcc  -> (GenBankRef | LRG)
    # GenBankRef -> (GI | AccNo | UD) (GeneSymbol | AccNo)?
    GenBankRef = (GI ^ AccNo ^ UD) + Optional(GeneSymbol ^ AccNo)
    RefSeqAcc = GenBankRef ^ LRG
  1. Adapt GenRecord code to retrieve transcript by its accession number. E.g. change GenRecord.Gene.findLocus() to accept transcriptID besides name.

  2. Write test querying Mutalyzer with variant description similar to the one in the problem statement above, checking a valid response for the requested transcript variant.

Notes:

  • The Position converter does not need to be adapted since it does not accept a transcript variant anyway. (@martijnvermaat: is this correct?)
  • The python example in task 1 above probably won't work as variable AccNo appears twice in the same rule. Will fix it, when I have more understanding of the pyparsing module.
  • @jfjlaros mentioned in an email that the version number is not needed when specifying a transcript variant by accession number. (@martijnvermaat: can you explain why and how?)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment