We're looking for someone comfortable with both basic concepts in bioinformatics and the Python programming language (or at least the willingness to learn). One useful way for us to get a sense for your skill in these areas is in the form of a working code sample. Here's what should be a fairly straightforward exercise:
- Perform a web BLAST query against nr using one or more DNA sequences read from a file in fasta-format (you can use any third-party libraries installable from PyPi that you like).
- If you include dependencies outside of the standard library, please include instructions for installation to a virtualenv.
- The script should use the
argparse
module to read options from the command line. - The output should be in the form of one or more tables written to an sqlite database summarizing the results of the BLAST query. Include the fields that you think are most informative.
- Construct the database so that it is easy to identify the matching records for each query sequence by (at least) description, percent ID, and E-value.
- Please use python 3.5+
We are looking for a (hopefully self-documenting) python script - that's it. In general, simpler is better.
Here are some input sequences to try out:
>NC_009085_A1S_r15 NC_009085.1 Acinetobacter baumannii ATCC 17978 chromosome, complete genome. ATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGGGGAAGGTAGCTTGCTAC TGGACCTAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAAC ATCTCGAAAGGGATGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGATCTTCGGA CCTTGCGCTAATAGATGAGCCTAAGTCGGATTAGCTAGTTGGTGGGGTAAAGGCCTACCA AGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGC CCAGA >NC_003909_BCE_5738 NC_003909.8 Bacillus cereus ATCC 10987, complete genome. GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAATGGATTAAGAGCTTGCT CTTATGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGG ATAACTCCGGGAAACCGGGGCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATT GAAAGGCGGCTTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAGCTAGTTGGTGAG GTAACGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGG GACTGAGACACGGCCCAGA
This isn't meant to consume a lot of your time - a minimal solution is fine; we really just want to be able to gauge your ability to write clean, idiomatic Python and your familiarity (or ability to become familiar) with Python as a bioinformatics tool.