Skip to content

Instantly share code, notes, and snippets.

@nhoffman
Last active May 4, 2021 04:08
Show Gist options
  • Save nhoffman/642e60106472f2d6d152 to your computer and use it in GitHub Desktop.
Save nhoffman/642e60106472f2d6d152 to your computer and use it in GitHub Desktop.
Python code sample

Python code sample

We're looking for someone comfortable with both basic concepts in bioinformatics and the Python programming language (or at least the willingness to learn). One useful way for us to get a sense for your skill in these areas is in the form of a working code sample. Here's what should be a fairly straightforward exercise:

  • Perform a web BLAST query against nr using one or more DNA sequences read from a file in fasta-format (you can use any third-party libraries installable from PyPi that you like).
  • If you include dependencies outside of the standard library, please include instructions for installation to a virtualenv.
  • The script should use the argparse module to read options from the command line.
  • The output should be in the form of one or more tables written to an sqlite database summarizing the results of the BLAST query. Include the fields that you think are most informative.
  • Construct the database so that it is easy to identify the matching records for each query sequence by (at least) description, percent ID, and E-value.
  • Please use python 3.5+

We are looking for a (hopefully self-documenting) python script - that's it. In general, simpler is better.

Here are some input sequences to try out:

>NC_009085_A1S_r15 NC_009085.1 Acinetobacter baumannii ATCC 17978 chromosome, complete genome.
ATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGGGGAAGGTAGCTTGCTAC
TGGACCTAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAAC
ATCTCGAAAGGGATGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGATCTTCGGA
CCTTGCGCTAATAGATGAGCCTAAGTCGGATTAGCTAGTTGGTGGGGTAAAGGCCTACCA
AGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGC
CCAGA
>NC_003909_BCE_5738 NC_003909.8 Bacillus cereus ATCC 10987, complete genome.
GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAATGGATTAAGAGCTTGCT
CTTATGAAGTTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCATAAGACTGGG
ATAACTCCGGGAAACCGGGGCTAATACCGGATAACATTTTGAACCGCATGGTTCGAAATT
GAAAGGCGGCTTCGGCTGTCACTTATGGATGGACCCGCGTCGCATTAGCTAGTTGGTGAG
GTAACGGCTCACCAAGGCAACGATGCGTAGCCGACCTGAGAGGGTGATCGGCCACACTGG
GACTGAGACACGGCCCAGA

This isn't meant to consume a lot of your time - a minimal solution is fine; we really just want to be able to gauge your ability to write clean, idiomatic Python and your familiarity (or ability to become familiar) with Python as a bioinformatics tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment