Last active
April 3, 2023 15:04
-
-
Save sandutsar/8db1c61efd24878e30867b10dfdfac48 to your computer and use it in GitHub Desktop.
(WIP) Solution for Rosalind Bioinformatics Stronghold course
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Rosalind Bioinformatics Stronghold" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"DNA_ALPHABET = 'ACGT'\n", | |
"RNA_ALPHABET = 'ACGU'\n", | |
"\n", | |
"COMPLEMENT = {\n", | |
" 'A': 'T',\n", | |
" 'C': 'G',\n", | |
" 'G': 'C',\n", | |
" 'T': 'A'\n", | |
"}\n", | |
"\n", | |
"def is_DNA(nt):\n", | |
" return True if nt in DNA_ALPHABET else False\n", | |
"\n", | |
"def is_RNA(nt):\n", | |
" return True if nt in RNA_ALPHABET else False" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## DNA\n", | |
"\n", | |
"### Counting DNA Nucleotides\n", | |
"\n", | |
"**Problem**\n", | |
"\n", | |
"A `string` is simply an ordered collection of symbols selected from some `alphabet` and formed into a word; the `length` of a string is the number of symbols that it contains.\n", | |
"\n", | |
"An example of a length 21 `DNA string` (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is \"ATGCTTCAGAAAGGTCTTACG.\"\n", | |
"\n", | |
"<span style=\"color:green\">Given:</span> A DNA string **s** of length at most 1000 nt.\n", | |
"\n", | |
"<span style=\"color:green\">Return:</span> Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in **s**.\n", | |
"\n", | |
"**Sample Dataset**\n", | |
"\n", | |
"```\n", | |
"AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC\n", | |
"```\n", | |
"\n", | |
"**Sample Output**\n", | |
"\n", | |
"```\n", | |
"20 12 17 21\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'CTAGTCCGAGCGTAGCCCTTTCCCCAAGTTTATCTGCACGGCCCGAGACCCAAGCCGGTTCTTTTCAATGCCTAGGTTCTTCTTCGGCTTCGCCCGCTATGACCGCATTATTACTTACCATCCCAGCACCGGCGCCAGCGCCTGACGACTTAGATCTATTGTGTTACTGGAGTCAATAAAGTCACTTGCACGCAATTAACAGAAATTATCACGAGCGTCTAGGGCTCCACGCACAATCACGCTATCCTCCAATGCGCCATTTTGACCCGTCGTGAAGGCATTAGAACATTGGTATAGTTGCTTTCGCGACTATCCAACCGCTAGGGTCTACTCATGACTAGTGTAGACGCAGCTAGTGGAGTAGCTATTGGAATTTCCACTCACAGCGTTGCCGGTCTCACACCTGATACGCGGTTGGTCCCGCTTGAGCGAGCCGTCCTGACGGGTAGATGCGACCCCACTTAACGTTTCACCAGGAAATGGCGTCAGTCGTAAGCACTAGCTACGCTTAGGATTCTTCAGTGCGCGGGGCCGCATCCAAGTGCGGGGACTTCGAATGCTGCTTCAGAGTAATTCGGTACATTCCAAGAAGCAGGGCGGCTCACACACTCTGTACTCCGTCTAGGTGGCCGCGCGCACCGCCGAGCCTTGTGCTATTTCATGCGAGAGAAAACAATTTCTTCGGACAGTTGTTTAATCCAGCCAATTTGATATTAACAGAGCCTACTATGACGGAAACTCGTGCCATAATACCCAACTGGGGTTCATTTCTGGGGACTCCGCTGCGAGGTGCGTTTCGCGGTGATAGTGACCTGACGGCTCGCAAGTAGCTGTAACAATACCCGACGTCGGT'" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"with open('rosalind_dna_1_dataset.txt', 'r') as file:\n", | |
" s = file.readline()[:-1]\n", | |
"s" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert isinstance(s, str), f'Error: type(s) = {type(s).__name__} must be str!'\n", | |
"assert all(list(map(is_DNA, s))), f'Error: Your string is not a valid DNA string! \\\n", | |
" It must be composed using {DNA_ALPHABET} alphabet!'\n", | |
"assert len(s) <= 1e3, f'Error: len(s) = {len(s)} must be at most 1000!'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"192 239 209 213\n" | |
] | |
} | |
], | |
"source": [ | |
"result = [s.count(nt) for nt in DNA_ALPHABET]\n", | |
"print(*result)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with open('rosalind_dna_1_output.txt', 'w') as file:\n", | |
" file.write(' '.join([str(x) for x in result]))" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## RNA\n", | |
"\n", | |
"### Transcribing DNA into RNA \n", | |
"\n", | |
"**Problem**\n", | |
"\n", | |
"An `RNA string` is a `string` formed from the `alphabet` containing 'A', 'C', 'G', and 'U'.\n", | |
"\n", | |
"Given a `DNA string` **t** corresponding to a coding strand, its transcribed `RNA string` **u** is formed by replacing all occurrences of 'T' in **t** with 'U' in **u**.\n", | |
"\n", | |
"<span style=\"color:green\">Given:</span> A `DNA string` **t** having `length` at most 1000 `nt`.\n", | |
"\n", | |
"<span style=\"color:green\">Return:</span> The transcribed RNA string of **t**.\n", | |
"\n", | |
"**Sample Dataset**\n", | |
"\n", | |
"```\n", | |
"GATGGAACTTGACTACGTAAATT\n", | |
"```\n", | |
"\n", | |
"**Sample Output**\n", | |
"\n", | |
"```\n", | |
"GAUGGAACUUGACUACGUAAAUU\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'ACTGTTCTCACTGGAAACACACGCGAGGATTTGTGGCGTAACCGTGCCTCGGTCGCATACAAGCGACTCGACCTCCGAATTGTCATATAGCCACGTGCCACCTCTCCTAAGCAAAGGCGTAGATAGAGAATTGGCTCCATGTTATGACAACTTAACTACGTATTGGATCCCAATTGGCGTTTTGAGGGCCTATGGGATAGAGCATGCTGCATCATAGTCATCGAAACCGTTAAGCCATGGGCCCTTAAGAACAGAAAATAGTGGCCCTGGAACGGCCCAACATTAGAGAAGTCGCCCTTGACCCGCACCGAATCGGCTGCCAGCAAAATGGGCATCTACTATAGATTGAAGACCAGTCTTTGCTAGCTACATCCGAGCCGTACGCTACTAATAGCCTCACGATTTGCGCCCGTTTATAATCAGCCTGCCCGACGGTTGATGACGTCCAATTTCCTCGCCTAATAGCACTCTCAGGGAGTAATTGGATCGGCTACGGCAACGTCGATATTATGTAGGTTATCCACGTGAACTTCGCGTCGCAGTACCGACAGCGTGGTTTTACGGGAGGTTCATTCGCTTCTTTTCTGTTTCATCGGCGTTCCGATCGGACTTCAGTACAAAACTGACCTCGGTGACAAAACCGCCAATCTGAGGGGGAAATCACGAATTCAATTGTGTGAGCACTCTGTCGGCTCACACTATTGATTTTTCTTCTAGAATGTAGATACTCCTTAACTCACTCACACGCGGTGGAGGGCCTAGCAACCGTGCGTCTAACCACGATTTCTCACGACACGAGGGTGGTCATGCCACGCATAGTAACTAGCTACATTCATCGTTCTATGTTGGCAGCAGGTTAGGGACTTCCCCGATGCGTTAGCTACAAGAGCGAGAGGTTTTTCACCGGAACGAAGGAAACTTCTAATCACGTACT'" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"with open('rosalind_rna_1_dataset.txt', 'r') as file:\n", | |
" t = file.readline()[:-1]\n", | |
"t" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert isinstance(t, str), f'Error: type(t) = {type(t).__name__} must be str!'\n", | |
"assert all(list(map(is_DNA, t))), f'Error: Your string is not a valid DNA string! \\\n", | |
" It must be composed using {DNA_ALPHABET} alphabet!'\n", | |
"assert len(t) <= 1e3, f'Error: len(t) = {len(t)} must be at most 1000!'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'ACUGUUCUCACUGGAAACACACGCGAGGAUUUGUGGCGUAACCGUGCCUCGGUCGCAUACAAGCGACUCGACCUCCGAAUUGUCAUAUAGCCACGUGCCACCUCUCCUAAGCAAAGGCGUAGAUAGAGAAUUGGCUCCAUGUUAUGACAACUUAACUACGUAUUGGAUCCCAAUUGGCGUUUUGAGGGCCUAUGGGAUAGAGCAUGCUGCAUCAUAGUCAUCGAAACCGUUAAGCCAUGGGCCCUUAAGAACAGAAAAUAGUGGCCCUGGAACGGCCCAACAUUAGAGAAGUCGCCCUUGACCCGCACCGAAUCGGCUGCCAGCAAAAUGGGCAUCUACUAUAGAUUGAAGACCAGUCUUUGCUAGCUACAUCCGAGCCGUACGCUACUAAUAGCCUCACGAUUUGCGCCCGUUUAUAAUCAGCCUGCCCGACGGUUGAUGACGUCCAAUUUCCUCGCCUAAUAGCACUCUCAGGGAGUAAUUGGAUCGGCUACGGCAACGUCGAUAUUAUGUAGGUUAUCCACGUGAACUUCGCGUCGCAGUACCGACAGCGUGGUUUUACGGGAGGUUCAUUCGCUUCUUUUCUGUUUCAUCGGCGUUCCGAUCGGACUUCAGUACAAAACUGACCUCGGUGACAAAACCGCCAAUCUGAGGGGGAAAUCACGAAUUCAAUUGUGUGAGCACUCUGUCGGCUCACACUAUUGAUUUUUCUUCUAGAAUGUAGAUACUCCUUAACUCACUCACACGCGGUGGAGGGCCUAGCAACCGUGCGUCUAACCACGAUUUCUCACGACACGAGGGUGGUCAUGCCACGCAUAGUAACUAGCUACAUUCAUCGUUCUAUGUUGGCAGCAGGUUAGGGACUUCCCCGAUGCGUUAGCUACAAGAGCGAGAGGUUUUUCACCGGAACGAAGGAAACUUCUAAUCACGUACU'" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"u = t.replace('T', 'U')\n", | |
"u" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with open('rosalind_rna_1_output.txt', 'w') as file:\n", | |
" file.write(u)" | |
] | |
}, | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## REVC\n", | |
"\n", | |
"### Complementing a Strand of DNA\n", | |
"\n", | |
"**Problem**\n", | |
"\n", | |
"In `DNA strings`, `symbols` 'A' and 'T' are complements of each other, as are 'C' and 'G'.\n", | |
"\n", | |
"The `reverse complement` of a `DNA string` **s** is the string **s<sup>c</sup>** formed by reversing the symbols of **s**, then taking the complement of each symbol (e.g., the reverse complement of \"GTCA\" is \"TGAC\").\n", | |
"\n", | |
"<span style=\"color:green\">Given:</span> A DNA string **s** of length at most 1000 `bp`.\n", | |
"\n", | |
"<span style=\"color:green\">Return:</span> The reverse complement **s<sup>c</sup>** of **s**.\n", | |
"\n", | |
"**Sample Dataset**\n", | |
"\n", | |
"```\n", | |
"AAAACCCGGT\n", | |
"```\n", | |
"\n", | |
"**Sample Output**\n", | |
"\n", | |
"```\n", | |
"ACCGGGTTTT\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'CGCCCCGCAGGTTAACTCTCTTGACTTGGGCAGGGAGGTCTCAAACTTATGGTGCGCGGATTATGTTCAGGGTGATAATTCCGACCCTGGCGGTGCGGCGATATGTAGACCGCAAACTCGCGTCGGTGGGAACAAAGACCCCTTGCTTTTTCTCTAGAGCTTCTACGAGGAGTTTAATACTCCGAGCACCCTGTGTAATTGCGTTCCCGTCCCGTGCCGGGTATACCAGTGCGTATTTGCTTTACGTCCTACAATGATGCTATGTTTACATCATTAACCAGCTGACTGCTTTATGGTGCGAACTTATTCGCGGCCGATAGAATCCAGTAGCTAGCCGTCGAGTTATTATCAAAAGATACACCAGTTTTAAGTTATCATCGTAGTCAAACCCTTGTCCCGGCCCTCTTAAACCCATCGCCGGAAGTCAGGATCCTTACGCAGTAATGGCTAACGTACCTGGGAACTTTGCTTTATGGCATAGGCCATACTGGTCTTACGAGAGGGGAACCGGCTTTTCAATGCTGCCTCCGCTAATGTTTATCGATATTAATCCAGCTTGTAGTCCAGATGTGGAATAAATTCACGCCCCCCCCCTTCGATACGTTCTCTTAAGCTACAAGCGAACTGACAACCCTATGCGAGGAGCCTTGCATTCTACTGATTCTGTACTGCTCATGAATTCGTCGGGGCTGGCGGTAAGTTCTCGGAACCATACCGTTACATACCTACGACTTTGCAAAGGGGAATTAATAGGCGCTTGTTACTCTTAGCTTCGCGCTCGTCACATGATATGAACTTCCGCAACGCGGACCCATTGGCATTGCGTGCGCTGATGTAAGTCGGAATCCAAAGTATAGGGCCCTATACTGCGGTTACTGCAATGCTGTAGGCTGTTTACAGTGGTTCTTACGGAACAGCACGCCCGCAAACTTTTCTACTGTGATATTTTCATGTAACGGAAACAGCTCGACTGAAAATGTGCTTCAC'" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"with open('rosalind_revc_1_dataset.txt', 'r') as file:\n", | |
" s = file.readline()[:-1]\n", | |
"s" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"assert isinstance(s, str), f'Error: type(s) = {type(s).__name__} must be str!'\n", | |
"assert all(list(map(is_DNA, s))), f'Error: Your string is not a valid DNA string! \\\n", | |
" It must be composed using {DNA_ALPHABET} alphabet!'\n", | |
"assert len(s) <= 1e3, f'Error: len(s) = {len(s)} must be at most 1000!'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'GTGAAGCACATTTTCAGTCGAGCTGTTTCCGTTACATGAAAATATCACAGTAGAAAAGTTTGCGGGCGTGCTGTTCCGTAAGAACCACTGTAAACAGCCTACAGCATTGCAGTAACCGCAGTATAGGGCCCTATACTTTGGATTCCGACTTACATCAGCGCACGCAATGCCAATGGGTCCGCGTTGCGGAAGTTCATATCATGTGACGAGCGCGAAGCTAAGAGTAACAAGCGCCTATTAATTCCCCTTTGCAAAGTCGTAGGTATGTAACGGTATGGTTCCGAGAACTTACCGCCAGCCCCGACGAATTCATGAGCAGTACAGAATCAGTAGAATGCAAGGCTCCTCGCATAGGGTTGTCAGTTCGCTTGTAGCTTAAGAGAACGTATCGAAGGGGGGGGGCGTGAATTTATTCCACATCTGGACTACAAGCTGGATTAATATCGATAAACATTAGCGGAGGCAGCATTGAAAAGCCGGTTCCCCTCTCGTAAGACCAGTATGGCCTATGCCATAAAGCAAAGTTCCCAGGTACGTTAGCCATTACTGCGTAAGGATCCTGACTTCCGGCGATGGGTTTAAGAGGGCCGGGACAAGGGTTTGACTACGATGATAACTTAAAACTGGTGTATCTTTTGATAATAACTCGACGGCTAGCTACTGGATTCTATCGGCCGCGAATAAGTTCGCACCATAAAGCAGTCAGCTGGTTAATGATGTAAACATAGCATCATTGTAGGACGTAAAGCAAATACGCACTGGTATACCCGGCACGGGACGGGAACGCAATTACACAGGGTGCTCGGAGTATTAAACTCCTCGTAGAAGCTCTAGAGAAAAAGCAAGGGGTCTTTGTTCCCACCGACGCGAGTTTGCGGTCTACATATCGCCGCACCGCCAGGGTCGGAATTATCACCCTGAACATAATCCGCGCACCATAAGTTTGAGACCTCCCTGCCCAAGTCAAGAGAGTTAACCTGCGGGGCG'" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sc = [COMPLEMENT[nt] for nt in s]\n", | |
"sc.reverse()\n", | |
"sc = ''.join(sc)\n", | |
"sc" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"with open('rosalind_revc_1_output.txt', 'w') as file:\n", | |
" file.write(sc)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.2" | |
}, | |
"orig_nbformat": 4 | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment