Skip to content

Instantly share code, notes, and snippets.

@cedrickchee
Last active May 17, 2018 00:13
Show Gist options
  • Save cedrickchee/b6ee9b96020031f94c641c9ae0af42e1 to your computer and use it in GitHub Desktop.
Save cedrickchee/b6ee9b96020031f94c641c9ae0af42e1 to your computer and use it in GitHub Desktop.
Some possible NLP applications in genomics

Originally forked from Philipp Bayer's gist. All credits goes to him.

This gist convert the original text to markdown for better readability.

Problems and Ideas:

1. Gene function prediction - given a predicted protein or gene sequence, what is the function?

The classic approach is to use something like BLAST to compare with known sequences, but this has many drawbacks. For starters, in plants the databases lean very heavily towards Arabidopsis thaliana, not more common plants such as maize or wheat.

People do get around this by looking for protein domains (Hidden Markov Models) but that doesn't go very far either, you have to describe domains first, and many are very generic.

Can we classify protein/gene sequence using RNN/CNNs? Here's an example where someone tried

There are also graph-databases which link genes with similar genes, the literature, protein domains, (I'm a bit involved in KNetMiner), summarising that graph would also be useful.

2. Functional region prediction - given a genome assembly, can we find genes, can we find functional elements?

The majority of a plant genome assembly is retrotransposons/repeats (up to 80%), that's not useful for us, we want to know where genes are. Currently this is solved by training Hidden Markov Models on known genes and then comparing with alignments of expressed genes.

Problems are that you won't see rarely expressed genes, all of it takes a long time, it's easy to miss genes or to misassemble genes (split a longer gene into 'sub' genes etc.) I am not aware of any classifier that takes a genome assembly and finds genes, but there are some which find other smaller regulatory elements.

  • DeepSea is a good example for finding regulatory regions.
  • Played around a bit with dna2vec here
  • IMHO classifying regions into genes, regulatory elements, pseudogenes all together would be amazing.

3. Population genetics

4. There was a lot of hype recently around DeepVariant

It takes pictures of genomic read alignments with a reference and calls genomic variants from there, but imho it didn't really improve on the accuracy we had using regular text-based comparisons.

These are the areas I work with, there is so much more out there now!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment