nhoffman/codesample.rst Secret

## codesample.rst

      
    Raw
  

              codesample.rst
            
          
    Python code sample

We're looking for someone comfortable with both basic concepts in
bioinformatics and the Python programming language (or at least the
willingness to learn). One useful way for us to get a sense for your
skill in these areas is in the form of a working code sample. Here's
what should be a fairly straightforward exercise:

Perform a web BLAST query against nr using one or more DNA sequences
read from a file in fasta-format (you can use any third-party
libraries installable from PyPi that you like).
If you include dependencies outside of the standard library, please
include instructions for installation to a virtualenv.
The script should use the argparse module to read options from
the command line.
The output should be in the form of one or more tables written to an
sqlite database summarizing the results of the BLAST query. Include
the fields that you think are most informative.
Construct the database so that it is easy to identify the matching
records for each query sequence by (at least) description, percent
ID, and E-value.
Please use python 3.5+

We are looking for a (hopefully self-documenting) python script - that's
it. In general, simpler is better.
Here are some input sequences to try out:
>NC_009085_A1S_r15 NC_009085.1 Acinetobacter baumannii ATCC 17978 chromosome, complete genome.
ATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGGGGAAGGTAGCTTGCTAC
TGGACCTAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAAC
ATCTCGAAAGGGATGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGATCTTCGGA
CCTTGCGCTAATAGATGAGCCTAAGTCGGATTAGCTAGTTGGTGGGGTAAAGGCCTACCA
AGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGC
CCAGA
>NC_003909_BCE_5738 NC_003909.8 Bacillus cereus ATCC 10987, complete genome.
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAATGGATTAAGAGCTTGCT
CTTATGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGG
ATAACTCCGGGAAACCGGGGCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATT
GAAAGGCGGCTTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAGCTAGTTGGTGAG
GTAACGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGG
GACTGAGACACGGCCCAGA

This isn't meant to consume a lot of your time - a minimal solution is
fine; we really just want to be able to gauge your ability to write
clean, idiomatic Python and your familiarity (or ability to
become familiar) with Python as a bioinformatics tool.