mkroon1/proposal_mutalyzer_API.md

## proposal_mutalyzer_API.md

      
    Raw
  

              proposal_mutalyzer_API.md
            
          
    RNA sequence reference to identify transcript variant

Problem

Mutalyzer uses transcript variants to disambiguate descriptions on
(non-)coding DNA and protein level 2. Currently, gene symbols with
optional transcript order number or protein isoform (as provided in the
reference sequence annotation) are used to identify transcript variants
1.
LOVD uses the Mutalyzer API for many features. Per transcript it stores
the gene symbol, transcript accession number and transcript order
number. Transcript variant annotations are unstable and may change over
time, causing LOVD to use an invalid transcript number.
Example variant on coding DNA in LOVD:
UD_132118785483(NPHP4_v001):c.2902G>A

The transcript variant in this example (NPHP4_v001) is outdated (as
we assume it used to be correct). Mutalyzer.nl Namechecker complains
when given this variant as input: "G not found at position 120212,
found T instead."
Changing the transcript variant to NPHP4_v005 (which is annotated
with a transcript sequence) makes the error go away.
Proposed solution

Have Mutalyzer recognize an RNA sequence reference as transcript
variant. To continue the example above, the client (LOVD) can call the
Mutalyzer Namechecker API with the following variant description:
UD_132118785483(NM_015102.3):c.2902G>A

The response would be:
# Genomic description
UD_132118785483:g.122458G>A

# Alternative chromosomal position
NC_000001.10:g.5935076C>T

# Affected transcripts
UD_132118785483(NPHP4_v005):c.2902G>A
...

# Affected proteins
UD_132118785483(NPHP4_i005):p.(Ala968Thr)
...

Note that the descriptions in the response use the old-style transcript
variant notation.
Implementation details

Tasks:

Extend grammar to include transcript accession number as transcript
variant. I.e.:

    # RefSeqAcc  -> (GI | AccNo | UD | LRG) (`(' GeneSymbol `)')?
    GenBankRef = (GI ^ AccNo ^ UD) + Optional(GeneSymbol)
    RefSeqAcc = GenBankRef ^ LRG
Will change into something like:
    # RefSeqAcc  -> (GenBankRef | LRG)
    # GenBankRef -> (GI | AccNo | UD) (GeneSymbol | AccNo)?
    GenBankRef = (GI ^ AccNo ^ UD) + Optional(GeneSymbol ^ AccNo)
    RefSeqAcc = GenBankRef ^ LRG


Adapt GenRecord code to retrieve transcript by its accession number.
E.g. change GenRecord.Gene.findLocus() to accept transcriptID
besides name.


Write test querying Mutalyzer with variant description similar to
the one in the problem statement above, checking a valid response
for the requested transcript variant.


Notes:

The Position converter does not need to be adapted since it does
not accept a transcript variant anyway. (@martijnvermaat: is this
correct?)
The python example in task 1 above probably won't work as variable
AccNo appears twice in the same rule. Will fix it, when I have more
understanding of the pyparsing module.
@jfjlaros mentioned in an email that the version number is not
needed when specifying a transcript variant by accession number.
(@martijnvermaat: can you explain why and how?)