Skip to content

Instantly share code, notes, and snippets.

@jph00
Created May 16, 2018 17:55
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save jph00/04c1bf4bbe574d231ec4e7ada954a857 to your computer and use it in GitHub Desktop.

(from @philippbayer)

1. Gene function prediction - given a predicted protein or gene sequence, what is the function?

The classic approach is to use something like BLAST to compare with known sequences, but this has many drawbacks. For starters, in plants the databases lean very heavily towards Arabidopsis thaliana, not more common plants such as maize or wheat.

People do get around this by looking for protein domains (Hidden Markov Models) but that doesn't go very far either, you have to describe domains first, and many are very generic. Can we classify protein/gene sequence using RNN/CNNs? Here's an example where someone tried

There are also graph-databases which link genes with similar genes, the literature, protein domains, (I'm a bit involved in KNetMiner), summarising that graph would also be useful.

2. Functional region prediction - given a genome assembly, can we find genes, can we find functional elements?

The majority of a plant genome assembly is retrotransposons/repeats (up to 80%), that's not useful for us, we want to know where genes are. Currently this is solved by training Hidden Markov Models on known genes and then comparing with alignments of expressed genes.

Problems are that you won't see rarely expressed genes, all of it takes a long time, it's easy to miss genes or to misassemble genes (split a longer gene into 'sub' genes etc.) I am not aware of any classifier that takes a genome assembly and finds genes, but there are some which find other smaller regulatory elements.

  • DeepSea is a good example for finding regulatory regions.
  • Played around a bit with dna2vec here
  • IMHO classifying regions into genes, regulatory elements, pseudogenes all together would be amazing.

3. Population genetics

4. There was a lot of hype recently around DeepVariant

It takes pictures of genomic read alignments with a reference and calls genomic variants from there, but imho it didn't really improve on the accuracy we had using regular text-based comparisons.

These are the areas I work with, there is so much more out there now!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment