Skip to content

Instantly share code, notes, and snippets.

@hossainlab
Created January 20, 2020 15:32
Show Gist options
  • Save hossainlab/9cc08cd2333fbea9116680832b064079 to your computer and use it in GitHub Desktop.
Save hossainlab/9cc08cd2333fbea9116680832b064079 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# DNA Translation ⇒ DNA to Protein\n",
"**Translation Theory : DNA ⇒ RNA ⇒ Protein**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tasks\n",
"1. Manually download DNA and Protein(to check our solution) sequence from NCBI.\n",
"2. Import DNA data into Python \n",
"3. Create an algorithm for DNA translation\n",
"4. Check if translation matches your dowloaded protein sequence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task-1: Manually Download Sequence File\n",
"- Go to NCBI and select Nucleotide database \n",
"- Enter the accession code, **NM_207618.2**\n",
"- Click on **FASTA**\n",
"- Copy and paste(without sequence info) the sequence into a text editor.\n",
"- Save both sequence as text file(.txt) into a folder called **data**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task-2: Import DNA Sequence Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"inputfile = \"../data/dna.txt\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Read the DNA sequence file \n",
"f = open(inputfile, \"r\")\n",
"seq = f.read()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA\\nGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT\\nCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT\\nTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT\\nCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG\\nAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA\\nACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA\\nGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT\\nTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA\\nGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA\\nCCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT\\nTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT\\nGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG\\nTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT\\nGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTC\\nACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTG\\nATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT\\n\\n'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# To see the DNA sequence \n",
"seq"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1175"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check length of DNA seq.\n",
"len(seq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> The sequence length is not correct! the original size of this sequence is 1157. The 18 characters are the special characters like **\\n**(new line), **\\r**(tab). The special characters affect the sequence quality and increase the size of this sequence. So, we have to process/clean this sequence file."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCA\n",
"GATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCT\n",
"CCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCT\n",
"TAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCT\n",
"CAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTG\n",
"AGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAA\n",
"ACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAA\n",
"GGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGAT\n",
"TTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCA\n",
"GTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGA\n",
"CCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTT\n",
"TATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATT\n",
"GCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGG\n",
"TCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTT\n",
"GCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTC\n",
"ACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTG\n",
"ATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT\n",
"\n",
"\n"
]
}
],
"source": [
"# Print the DNA sequence \n",
"print(seq)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# New lines replace with empty space to remove the extra characters from sequence file \n",
"seq = seq.replace(\"\\n\", \"\")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTCACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTGATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT'"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"seq"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1157"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check sequence length again! \n",
"len(seq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's correct! same as NCBI."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTCACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTGATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT\n"
]
}
],
"source": [
"# To see sequence again\n",
"print(seq)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Tabs replace with empty space to remove extra character from sequence file \n",
"seq = seq.replace(\"\\r\", \"\")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1157"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(seq)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GGTCAGAAAAAGCCCTCTCCATGTCTACTCACGATACATCCCTGAAAACCACTGAGGAAGTGGCTTTTCAGATCATCTTGCTTTGCCAGTTTGGGGTTGGGACTTTTGCCAATGTATTTCTCTTTGTCTATAATTTCTCTCCAATCTCGACTGGTTCTAAACAGAGGCCCAGACAAGTGATTTTAAGACACATGGCTGTGGCCAATGCCTTAACTCTCTTCCTCACTATATTTCCAAACAACATGATGACTTTTGCTCCAATTATTCCTCAAACTGACCTCAAATGTAAATTAGAATTCTTCACTCGCCTCGTGGCAAGAAGCACAAACTTGTGTTCAACTTGTGTTCTGAGTATCCATCAGTTTGTCACACTTGTTCCTGTTAATTCAGGTAAAGGAATACTCAGAGCAAGTGTCACAAACATGGCAAGTTATTCTTGTTACAGTTGTTGGTTCTTCAGTGTCTTAAATAACATCTACATTCCAATTAAGGTCACTGGTCCACAGTTAACAGACAATAACAATAACTCTAAAAGCAAGTTGTTCTGTTCCACTTCTGATTTCAGTGTAGGCATTGTCTTCTTGAGGTTTGCCCATGATGCCACATTCATGAGCATCATGGTCTGGACCAGTGTCTCCATGGTACTTCTCCTCCATAGACATTGTCAGAGAATGCAGTACATATTCACTCTCAATCAGGACCCCAGGGGCCAAGCAGAGACCACAGCAACCCATACTATCCTGATGCTGGTAGTCACATTTGTTGGCTTTTATCTTCTAAGTCTTATTTGTATCATCTTTTACACCTATTTTATATATTCTCATCATTCCCTGAGGCATTGCAATGACATTTTGGTTTCGGGTTTCCCTACAATTTCTCCTTTACTGTTGACCTTCAGAGACCCTAAGGGTCCTTGTTCTGTGTTCTTCAACTGTTGAAAGCCAGAGTCACTAAAAATGCCAAACACAGAAGACAGCTTTGCTAATACCATTAAATACTTTATTCCATAAATATGTTTTTAAAAGCTTGTATGAACAAGGTATGGTGCTCACTGCTATACTTATAAAAGAGTAAGGTTATAATCACTTGTTGATATGAAAAGATTTCTGGTTGGAATCTGATTGAAACAGTGAGTTATTCACCACCCTCCATTCTCT\n"
]
}
],
"source": [
"print(seq)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Codon table for translation\n",
"table = { \n",
" 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', \n",
" 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', \n",
" 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', \n",
" 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', \n",
" 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', \n",
" 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', \n",
" 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', \n",
" 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', \n",
" 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', \n",
" 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', \n",
" 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', \n",
" 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', \n",
" 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', \n",
" 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', \n",
" 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', \n",
" 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', \n",
" }"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# To see the table \n",
"table"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'CAA' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-14-0fb9e6e3dc2d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Lookup\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mtable\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mCAA\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'CAA' is not defined"
]
}
],
"source": [
"# Lookup / Extract infromation from the codon table \n",
"table[CAA]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Error occurs, because \"CCA\" is a string but we tried to access without quote. That's why Python interpreter give an error."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Q'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Extract \"CCA\"\n",
"table[\"CAA\"]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'P'"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table[\"CCT\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task-3: Create an Algorithm\n",
"- Step-1: Check the length of sequence is divisible by 3.\n",
"- Step-2: Look each 3 letter strings in table and store result.\n",
"- Step-3: Continue lookups untill reading end of sequence."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"protein = \"\"\n",
"if len(seq) % 3 == 0: \n",
" for i in range(0, len(seq), 3):\n",
" codon = seq[i:i+3]\n",
" protein += table[codon]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a function to translate DNA sequence into protein."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def translate(seq): \n",
" \"\"\"\n",
" Reads sequence file as input. Transcribe input sequence file into RNA, then translate into Protein.\n",
" \"\"\"\n",
" \n",
" table = { \n",
" 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', \n",
" 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', \n",
" 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', \n",
" 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', \n",
" 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', \n",
" 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', \n",
" 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', \n",
" 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', \n",
" 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', \n",
" 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', \n",
" 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', \n",
" 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', \n",
" 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', \n",
" 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', \n",
" 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', \n",
" 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', \n",
" }\n",
" \n",
" protein = \"\"\n",
" if len(seq) % 3 == 0: \n",
" for i in range(0, len(seq), 3):\n",
" codon = seq[i:i+3]\n",
" protein += table[codon]\n",
" return protein\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'I'"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Test translate function\n",
"translate(\"ATA\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'S'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"translate(\"TCA\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def translate(seq): \n",
" \"\"\"\n",
" Reads sequence file as input. Transcribe input sequence file into RNA, then translate into Protein.\n",
" \"\"\"\n",
" \n",
" table = { \n",
" 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', \n",
" 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', \n",
" 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', \n",
" 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', \n",
" 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', \n",
" 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', \n",
" 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', \n",
" 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', \n",
" 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', \n",
" 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', \n",
" 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', \n",
" 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', \n",
" 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', \n",
" 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', \n",
" 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', \n",
" 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', \n",
" }\n",
" \n",
" protein = \"\"\n",
" if len(seq) % 3 == 0: \n",
" for i in range(0, len(seq), 3):\n",
" codon = seq[i:i+3]\n",
" protein += table[codon]\n",
" return protein\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Help on function translate in module __main__:\n",
"\n",
"translate(seq)\n",
" Reads sequence file as input. Transcribe input sequence file into RNA, then translate into Protein.\n",
"\n"
]
}
],
"source": [
"help(translate)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'A'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"translate(\"GCC\")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'CCTGAAAACC'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Slicing sequence \n",
"seq[40:50]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a function to read sequence file "
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def read_seq(inputfile): \n",
" \"\"\"Reads and returns the input sequences with special characters removed\"\"\"\n",
" with open(inputfile, \"r\") as f: \n",
" seq = f.read() \n",
" seq = seq.replace(\"\\n\", \"\")\n",
" seq = seq.replace(\"\\r\", \"\") \n",
" \n",
" return seq "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Read DNA sequence by using read_seq() function\n",
"dna = read_seq(\"../data/dna.txt\")"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"''"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Call translate function\n",
"translate(dna)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" >Python gives us empty string! because **sequence is not divisible by 3**. NCBI translation starts with 21 and ends with 938. but in Python translation should start at 20 position and end at 938"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Empty string cause of length problem\n",
"len(seq) % 3 "
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC_'"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Call translation from 20 to 938\n",
"translate(dna[20:938])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task-4: Comparison"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# Read protein sequence\n",
"prt = read_seq(\"../data/protein.txt\")\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Not identical cause of stop codon\n",
"prt"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"translate(dna[20:935])"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'MSTHDTSLKTTEEVAFQIILLCQFGVGTFANVFLFVYNFSPISTGSKQRPRQVILRHMAVANALTLFLTIFPNNMMTFAPIIPQTDLKCKLEFFTRLVARSTNLCSTCVLSIHQFVTLVPVNSGKGILRASVTNMASYSCYSCWFFSVLNNIYIPIKVTGPQLTDNNNNSKSKLFCSTSDFSVGIVFLRFAHDATFMSIMVWTSVSMVLLLHRHCQRMQYIFTLNQDPRGQAETTATHTILMLVVTFVGFYLLSLICIIFYTYFIYSHHSLRHCNDILVSGFPTISPLLLTFRDPKGPCSVFFNC'"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prt "
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Comparison\n",
"prt == translate(dna[20:935])"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Exclude the last character\n",
"prt == translate(dna[20:938])[:-1]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment