Skip to content

Instantly share code, notes, and snippets.

@ravila4
Created April 29, 2018 15:52
Show Gist options
  • Save ravila4/0eb85e7299e604d9844f19d62dad49c2 to your computer and use it in GitHub Desktop.
Save ravila4/0eb85e7299e604d9844f19d62dad49c2 to your computer and use it in GitHub Desktop.
Introduction to Biopython
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://biopython.org/assets/images/biopython_logo_s.png\" style=\"height:100px; float:right;\">\n",
"\n",
"\n",
"# Introduction to Biopython"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Biopython Libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Biopython is a collection of many libraries. Some of the most common are:\n",
"\n",
"- **Seq** and **SeqRecord** objects\n",
"- **Bio.SeqIO** - sequence input/output\n",
"- **Bio.AlignIO** - alignment input/output\n",
"- **Bio.PopGen** - population genetics\n",
"- **Bio.PDB** - structural bioinformatics\n",
"\n",
"We commonly import only the specific libraries that we are going to use."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Importing the Seq object from the Seq library\n",
"from Bio.Seq import Seq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sequence Objects"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sequence objects are just python strings with an optional attribute called ***alphabet***. The alphabet is an optional parameter that specifies whether the sequence is DNA, Protein, or something else."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('ATGCTCGCG', Alphabet())"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from Bio.Seq import Seq\n",
"\n",
"# Creating a sequence object\n",
"my_seq = Seq(\"ATGCTCGCG\")\n",
"\n",
"# By default sequence objects are created with a generic alphabet\n",
"my_seq"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('ATGCTCGCG', DNAAlphabet())"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# You can also specify the type of alphabet to use\n",
"# Available alphabets from Bio.Alphabet: generic_dna, generic_protein, generic_rna...\n",
"from Bio.Alphabet import generic_protein, generic_dna\n",
"my_seq = Seq(\"ATGCTCGCG\", generic_dna)\n",
"my_seq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since sequences are very much like strings, many of the functions that work on strings also work on Sequences. \n",
"For example, find() and count(). You can also index and slice a sequence the same way you would a string."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ATGCTCGCG\n",
"Length: 9\n",
"1\n",
"2\n"
]
}
],
"source": [
"# Print a sequence and its length\n",
"print(my_seq)\n",
"print(\"Length:\", len(my_seq))\n",
"\n",
"#Counting the number of occurrences of AA\n",
"seq_1 = Seq(\"GCAAAT\")\n",
"print(seq_1.count(\"AA\"))\n",
"print(seq_1.count_overlap(\"AA\"))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50.0\n"
]
}
],
"source": [
"#### Find the GC content of the following sequence:\n",
"seq_2 = Seq(\"atgcgcaatctacagcccaacatgtcacgttgaggctagctt\", generic_dna)\n",
"\n",
"print((seq_2.count(\"g\") + seq_2.count(\"c\")) / len(seq_2) * 100)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sequences also have special functions to perform common sequence operations, like complement(), translate(), and transcribe()."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tacgcgttagatgtcgggttgtacagtgcaactccgatcgaa\n"
]
},
{
"data": {
"text/plain": [
"Seq('aagctagcctcaacgtgacatgttgggctgtagattgcgcat', DNAAlphabet())"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Finding the complement of seq_2\n",
"print(seq_2.complement())\n",
"\n",
"#### Find the reverse complement of seq_2\n",
"seq_2.reverse_complement()\n",
"\n",
"#### Transcribe seq_2 into RNA\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('MRNLQPNMSR*G*L', HasStopCodon(ExtendedIUPACProtein(), '*'))"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Translating seq_2\n",
"seq_2.translate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the sequence above contains several stop codons. If you want to end translation at the first stop codon, as it happens in nature, you can set the ***to_stop*** parameter to **True**."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('MRNLQPNMSR', ExtendedIUPACProtein())"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seq_2.translate(to_stop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aditionally, there is a parameter called ***table***, which you can use to specify a particular translation table, e.g. mitochondrial, or bacterial (the default table is called Standard). Biopython uses the translation tables from NCBI: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('MRNLQPNMSRWG', ExtendedIUPACProtein())"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seq_2.translate(table='Vertebrate Mitochondrial', to_stop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ***cds*** flag can be used to specify whether the given sequence represents a complete coding sequence. Using this flag will generate an error if the first codon is not a valid start codon, or if the last codon is not a valid stop codon for the sequence type."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"VKKMQSIVLALSLVLVAPM*\n",
"MKKMQSIVLALSLVLVAPM\n"
]
}
],
"source": [
"# In bacteria, the codon \"GTG\" can be a start codon.\n",
"# In this case, it codes for methionine, instead of valine.\n",
"gene = Seq(\"GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGTAA\", generic_dna)\n",
"print(gene.translate(table=\"Bacterial\"))\n",
"print(gene.translate(table=\"Bacterial\", cds=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## File Parsing and SeqRecord Objects"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A SeqRecod object is a **Seq** object, plus an identifier. Optionally, you can have other attributes, like description, and annotation and features.\n",
"\n",
"SeqRecord attributes:\n",
"\n",
"- **.seq** - The sequence itself, typically a Seq object.\n",
"- **.id** - The primary ID used to identify the sequence (e.g. accession number).\n",
"- **.name** - A 'common' name/id for the sequence.\n",
"- **.description** - A human readable description or expressive name.\n",
"- **.letter_annotations** - a dictionary of per-letter-annotations (e.g. quality scores).\n",
"- **.annotations** - A dictionary of additional information about the sequence.\n",
"- **.features** - A list of SeqFeature objects with more structured information about the features of a sequence.\n",
"- **.dbxrefs** - A list of database cross-references as strings.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can create your own SeqRecord object from scratch if you need to:\n",
"from Bio.SeqRecord import SeqRecord\n",
"my_seqrecord = SeqRecord(my_seq, id=\"my sequence id\")\n",
"print(my_seqrecord)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most common way of creating SeqRecords, however, is by parsing a file. All file parsers in the **Bio.SeqIO** module return these objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### FASTA Files"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ID: FJ178195.1\n",
"Name: FJ178195.1\n",
"Description: FJ178195.1 Aeromonas salmonicida subsp. salmonicida A449 beta galactosidase (lacZ) pseudogene mRNA, partial sequence\n",
"Number of features: 0\n",
"Seq('GTGCTGCAGGGGGAATGCCCACTCACACTCGCGCCTCGGGCCAGTGTGGCGCTG...GAG', SingleLetterAlphabet())\n"
]
}
],
"source": [
"# Importing the SeqIO module for parsing and writing files\n",
"from Bio import SeqIO\n",
"\n",
"# Parsing a FASTA file with a single entry:\n",
"record_1 = SeqIO.read('betagal.fasta', 'fasta')\n",
"print(record_1)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BAA03533.1 HSP60 [Staphylococcus aureus]\n",
"MVKQLKFSEDARQAMLRGVDQLANAVKVTIGPKGRNVVLDKEFTAPLITNDGVTIAKEIELEDPYENMGAKLVQEVANKTNEIAGDGTTTATVLAQAMIQEGLKNVTSGANPVGLRQGIDKAVKVAVEALHENSQKVENKNEIAQVGAISAADEEIGRYISEATEKVGNDGVITIITIEESNRLNTELELGMQFDRGYQSPYMVTDSDKMVAELERPYILVTDKKISSFQDILPLLEQVVQSNRPILIVADEVEGDALTNIVLNRMRGTFTAVAVKAPGFGDRRKAMLEDLAILTGAQVITDDLGLDLKDASIDMLGTASKVEVTKDNTTVVDGDGDENSIDARVSQLKSQIEETESDFDREKLQERLAKLAGGVAVIKVGAASETELKERKLRIEDALNSTRAAVEEGIVAGGGTALVNVYQKVSENEAEGDIETGVNIVLKALTAPVRQIAENAGLEGSVIVERLKNAEPGVGFNGATNEWVNMLRRGIVDPTKVTRSALQHAASVAAMFLTTEAVVASIPEKNNDQPNMGGMPGMM\n",
"BAE54309.1 HSP60 [Lawsonia intracellularis]\n",
"MASKEILFDAKAREKLSRGVDKLANAVKVTLGPKGRNVVIEKSFGSPVITKDGVSVAKEIELEDKFENMGAQMVKEVASKTSDIAGDGTTTATVLAQAIYREGVKLVAAGRNPMAIKRGIDKAVVAVTKELSDITKPTRDQKEIAQVGTISANSDTTIGNIIAEAMAKVGKEGVITVEEAKGLETTLDVVEGMKFDRGYLSPYFVTNPEKMVCELDNPYILCNEKKITSMKDMLPILEQVAKVNRPLLIIAEDVEGEALATLVVNKLRGALQVVAVKAPGFGERRKAMLEDIAILTGGGAIFEDRGIKLENVSLSSLGTAKRVVIDKENTTIVDGAGKSEDIKARVKQIRAQIEETSSDYDREKLQERLAKLVGGVAVIHVGAATETEMKEKKDRVEDALNATRAAVEEGIVPGGGTAFVRSIKVLDDIKPADDDELAGLNIIRRSLEEPLRQIAANAGYEGSIVVEKVREAKDGFGFNAASGEYEDLIKAGVIDPKKVTRIALQNAASVASLLLTTECAIAEKPEPKKDMPMPGGGMGGMGGMDGMY\n",
"KZV09502.1 HSP60 [Saccharomyces cerevisiae]\n",
"MLRSSVVRSRATLRPLLRRAYSSHKELKFGVEGRASLLKGVETLAEAVAATLGPKGRNVLIEQPFGPPKITKDGVTVAKSIVLKDKFENMGAKLLQEVASKTNEAAGDGTTSATVLGRAIFTESVKNVAAGCNPMDLRRGSQVAVEKVIEFLSANKKEITTSEEIAQVATISANGDSHVGKLLASAMEKVGKEGVITIREGRTLEDELEVTEGMRFDRGFISPYFITDPKSSKVEFEKPLLLLSEKKISSIQDILPALEISNQSRRPLLIIAEDVDGEALAACILNKLRGQVKVCAVKAPGFGDNRKNTIGDIAVLTGGTVFTEELDLKPEQCTIENLGSCDSITVTKEDTVILNGSGPKEAIQERIEQIKGSIDITTTNSYEKEKLQERLAKLSGGVAVIRVGGASEVEVGEKKDRYDDALNATRAAVEEGILPGGGTALVKASRVLDEVVVDNFDQKLGVDIIRKAITRPAKQIIENAGEEGSVIIGKLIDEYGDDFAKGYDASKSEYTDMLATGIIDPFKVVRSGLVDASGVASLLATTEVAIVDAPEPPAAAGAGGMPGGMPGMPGMM\n",
"AAC32614.3 Hsp60 [Cryptosporidium parvum]\n",
"MLLRSGINLYKSVEGSIGLRSAAIRFGMRYISSGKELSFGGKARKEMLKGANDLADAVGVTLGPRGRNVVIEQGFGEAPKITKDGVTVAKAIQFGKGSVNLGAQLLKNVAISTNEEAGDGTTTATVLARAIFKSGCEKVDAGLNPMDLLRGIKLGVEHVVNELDLLSQPVKSHDDILNVATISANGDSIVGSLIAQAYSKVGRHGTINIEEGNTTQSELEIVEGLKLDKGYISPYFITNQKYQKVELENPYILISQGKISSLKSILPILEFCISSRSPLLIIAEEIEGEALTALILNKLQLNLKVCAVKAPGFGDHRKQILEDISVSVGAKIIQEEFSNAKLDQMNSNQIQEFLGKCKSISVSKDETIITQGQGSPKDVKDTISLLKSQIEENQKLTDYDKEKLRERLARLTGRVALIKIGGYSDTEISELKDRFIDALNATKCAIEQGIVPGGGSALLWASRNLGKLYSQSPPPGKTLTPSQSSSNESNPIRNYDMAMGVKIVQDACKVPCHLISSNAGFDGSVIVGELVKVFSKGSKHFGFDAQTGQFVDMIESGILDPTKVVKSGLRDAASIASLMTTTQVSVFEPSNQSEKNNSSGSNSSESSSSFGSLPGDFY\n",
"AAC68350.1 HSP60 [Chlamydia trachomatis D/UW-3/CX]\n",
"MPHDNNEMHRNTIHQLFTGLDKAYQIVKGFYGPAYSSSSKDFFKGRGYHILSRIELSDPFERIGVYFARSLAKRIHKRHADGVISSVILLRAFLKASIPFIDQGLSPRLLASALASQKEAVCAYLHSHSFLLKDASKVLGLIRSHLPDPLIGEAFAEAVAYTGHEGAVALSQRSGSTLHLVKGIQTQKGYRVPSFFPHDSFHENPIVAPKIFVTDQKIHCLFPFLPLLKKFSEEQTPLIIFCKEIAPDPLATCIANRIAGLLDVLVVTIPDTTLLEDIALLTGTTVFSSPPFSNKPPIELPLLGSCTWAELSRDHTLLVCENLVPEVVKLKVRQLDHAIHNAEDETSRKLLKKRKHRLENSIAIIPVKQDTTPLHELALKTLNSTQESGFVLGGGAALLYATQSLSSSPEHSQEEQAAVQILQTACRTLLEQLVNSVYMDGKLVADKLCSLGTPSLGFNVVSQQIEDMISAGIITPLNVVLDIFSCSLHTAVDLLLASFTTPPTPAAKEKKT\n"
]
}
],
"source": [
"# Parsing a FASTA file with multiple entries:\n",
"for record in SeqIO.parse('hsp60.fasta', 'fasta'):\n",
" #print(record)\n",
" # Print the description for each FASTA entry\n",
" print(record.description)\n",
" print(record.seq)\n",
"\n",
" ### Print the length of each sequence in the FASTA file\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GenBank Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GenBank files work the same way, but have many more features and annotations. The features attribute is itself an object called **SeqFeature**, and has it's own attributes."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Equus caballus\n",
"MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG*\n",
"Rattus norvegicus\n",
"MGLSDGEWQMVLNIWGKVEGDLAGHGQEVLISLFKAHPETLEKFDKFKNLKSEEEMKSSEDLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISEVIIQVLKKRYSGDFGADAQGAMSKALELFRNDIAAKYKELGFQG*\n",
"Homo sapiens\n",
"MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG*\n"
]
}
],
"source": [
"### For each record in the GenBank file \"myoglobin.gbk\", print the source organism\n",
"### and translate the coding sequence.\n",
"### Note that the sequence in the genbank file may contain exons, or other non-coding regions.\n",
"\n",
"for r in SeqIO.parse(\"myoglobin.gbk\", \"genbank\"):\n",
" print(r.annotations['organism'])\n",
" for f in r.features:\n",
" if f.type == \"CDS\":\n",
" print(f.location.extract(r.seq).translate())\n",
"\n",
"\n",
"\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Writing Files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***SeqIO.write()*** can be used to write SeqRecords to a file, and is also useful for converting between file formats."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Converting a GenBank file to FASTA\n",
"records = list(SeqIO.parse('myoglobin.gbk', 'genbank'))\n",
"\n",
"SeqIO.write(records, \"myoglobin.fasta\", \"fasta\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# BLAST"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function **qblast()** from the NCBIWWW module is used for running BLAST from the web. It has three required parameters: the program name, the database to search against, and a sequence to search."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"from Bio.Blast import NCBIWWW\n",
"\n",
"# Running blastx over the internet\n",
"query = SeqIO.read(\"betagal.fasta\", \"fasta\").seq\n",
"results = NCBIWWW.qblast('blastx', 'nr', query, )\n",
"\n",
"# The results can only be read once, so we save it to a file:\n",
"with open(\"blast_results.xml\", \"w\") as output:\n",
" output.write(results.read())\n",
" results.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running BLAST locally (requres a local database) is more useful for BLASTing a large number of sequences."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#from Bio.Blast.Applications import NcbiblastpCommandline\n",
"\n",
"#my_blast = NcbiblastpCommandline(query=\"myoglobin.fasta\", db=\"nr\", evalue=0.001, outfmt=5, out=\"local_blast_results.xml\")\n",
"#my_blast()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parsing BLAST Output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"BLAST output should be saved in XML format. The BLAST output contains one record for every query sequence. A BLAST record is a complex object\n",
" which contains many alignments (matches), each of which contains multiple regions (hsps). The hsps have the e-value, query name, and subject (match target) names. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"from Bio.Blast import NCBIXML\n",
"\n",
"results = open(\"blast_results.xml\", 'r')\n",
"blast_records = list(NCBIXML.parse(results))\n",
"\n",
"# Threshold for BLAST results\n",
"E_VALUE_THRESH = 1e-20\n",
"\n",
"for record in blast_records:\n",
" if record.alignments:\n",
" print(\"QUERY: %s\" % record.query[:60])\n",
" for align in record.alignments:\n",
" for hsp in align.hsps:\n",
" if hsp.expect <= E_VALUE_THRESH:\n",
" print(\"MATCH: %s \" % align.title[:60])\n",
" print(\"\\t e-value:\", hsp.expect)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## More Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Biopython Wiki: http://biopython.org/wiki\n",
"- Collection of Jupyter notebooks on Biopython: https://github.com/tiagoantao/biopython-notebook/tree/master/notebooks"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment