Skip to content

Instantly share code, notes, and snippets.

@fjossinet
Last active August 29, 2015 13:56
Show Gist options
  • Save fjossinet/9035788 to your computer and use it in GitHub Desktop.
Save fjossinet/9035788 to your computer and use it in GitHub Desktop.
Create and manipulate secondary structures with PyRNA
{
"metadata": {
"name": "Create and manipulate secondary structures."
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": "PyRNA stores and describes an RNA secondary structure using two different data structures:\n\n* a pandas Dataframe\n* a pyrna.features.SecondaryStructure object\n\nThe pandas Dataframe lists all the base pairs (a.k.a interactions) making the secondary structure. If a pandas Dataframe restricts the exploration of a secondary structure to the level of the base pairs, a pyrna.features.SecondaryStructure gives us access to high-level objects like: \n\n* helices,\n* single-strands,\n* junctions.\n\nIn both data structures, the edges of the base pairs are described with three different characters:\n\n * '(' or ')' for Watson-Crick edges,\n * '[' or ']' for Hoogsteen edges,\n * '{' or '}' for Sugar edges.\n \nA base pair can also be in a cis ('c') or trans ('t') orientation."
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Creation of secondary structures from scratch"
},
{
"cell_type": "code",
"collapsed": false,
"input": "import json #to have a better output for dict describing pyrna.features.RNA objects (interactions, helices, single-strands,..)\n\nfrom pyrna.parsers import parse_bn\nrna = RNA(name = 'my_rna', sequence = 'GGGGGACAACCCC')\nbn = '(({(.....))))'\nbase_pairs = parse_bn(bn)\nprint base_pairs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " orientation edge1 edge2 pos1 pos2\n0 c ( ) 4 10\n1 c { ) 3 11\n2 c ( ) 2 12\n3 c ( ) 1 13\n\n[4 rows x 5 columns]\n"
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we transform this list of base-pairs into a pyrna.features.SecondaryStructure object."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.parsers import base_pairs_to_secondary_structure\nss = base_pairs_to_secondary_structure(rna, base_pairs)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we can browse over the helices. An helix contains a field to store non-canonical base pairs (meaning base pairs different from c() AU, c() GU or c() GC. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "for helix in ss.helices:\n print json.dumps(helix, indent = 2)\n print \"\\nAny non canonical base pairs in this helix?\\n\"\n for interaction in helix['interactions']:\n print json.dumps(interaction, indent = 2) ",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{\n \"length\": 4, \n \"interactions\": [\n {\n \"edge1\": \"{\", \n \"edge2\": \")\", \n \"orientation\": \"c\", \n \"location\": [\n [\n 3, \n 3\n ], \n [\n 11, \n 11\n ]\n ]\n }\n ], \n \"name\": \"H1\", \n \"location\": [\n [\n 1, \n 4\n ], \n [\n 10, \n 13\n ]\n ]\n}\n\nAny non canonical base pairs in this helix?\n\n{\n \"edge1\": \"{\", \n \"edge2\": \")\", \n \"orientation\": \"c\", \n \"location\": [\n [\n 3, \n 3\n ], \n [\n 11, \n 11\n ]\n ]\n}\n"
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": "or over the single-strands..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for single_strand in ss.single_strands:\n print json.dumps(single_strand, indent = 2)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{\n \"name\": \"SS1\", \n \"location\": [\n 5, \n 9\n ]\n}\n"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Any high-level object from a pyrna.features.SecondaryStructure is described with a \"location\". A location is a N x 2 matrix where N is the number of blocks of contiguous residues, each block being described with its start and end positions.\n\nAn helix location will be a matrix like [ [1,4] , [10,13] ], meaning that the first strand is between residues 1 and 4, and the second strand is between residues 10 and 13.\n\nA base pair location will be a matrix like [ [3,3] , [11,11] ], meaning a base pair between residues 3 and 11."
},
{
"cell_type": "markdown",
"metadata": {},
"source": "From a pyrna.features.SecondaryStructure object, we can easily go back to a list of base pairs stored in a pandas Dataframe."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.parsers import secondary_structure_to_base_pairs\nprint secondary_structure_to_base_pairs(ss)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " edge1 edge2 orientation pos1 pos2\n0 ( ) c 1 13\n1 ( ) c 2 12\n2 { ) c 3 11\n3 ( ) c 4 10\n\n[4 rows x 5 columns]\n"
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we can create a new secondary structure containing a single tertiary interaction. A tertiary interaction is defined as a base pair which is not contiguous to another base-pair."
},
{
"cell_type": "code",
"collapsed": false,
"input": "rna = RNA(name = 'my_rna', sequence = 'GGGGGACGCAGTAACCCC')\nbn = '(({(.(.....)..))))'\nbase_pairs = parse_bn(bn)\nprint base_pairs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " orientation edge1 edge2 pos1 pos2\n0 c ( ) 6 12\n1 c ( ) 4 15\n2 c { ) 3 16\n3 c ( ) 2 17\n4 c ( ) 1 18\n\n[5 rows x 5 columns]\n"
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The single tertiary interaction is stored in a pyrna.features.SecondaryStructure object."
},
{
"cell_type": "code",
"collapsed": false,
"input": "ss = base_pairs_to_secondary_structure(rna, base_pairs)\nfor tertiary_interaction in ss.tertiary_interactions:\n print json.dumps(tertiary_interaction, indent = 2)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{\n \"edge1\": \"(\", \n \"edge2\": \")\", \n \"orientation\": \"c\", \n \"location\": [\n [\n 6, \n 6\n ], \n [\n 12, \n 12\n ]\n ]\n}\n"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "If we want to go back to a list of base pairs in a pandas Dataframe, we need to precise if we want to keep the tertiary interactions..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print secondary_structure_to_base_pairs(ss, keep_tertiaries=True)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " edge1 edge2 orientation pos1 pos2\n0 ( ) c 1 18\n1 ( ) c 2 17\n2 { ) c 3 16\n3 ( ) c 4 15\n4 ( ) c 6 12\n\n[5 rows x 5 columns]\n"
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": "...or not"
},
{
"cell_type": "code",
"collapsed": false,
"input": "print secondary_structure_to_base_pairs(ss)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " edge1 edge2 orientation pos1 pos2\n0 ( ) c 1 18\n1 ( ) c 2 17\n2 { ) c 3 16\n3 ( ) c 4 15\n\n[4 rows x 5 columns]\n"
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Creation of secondary structures from files"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "PyRNA is able to parse several file formats describing secondary structures. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.parsers import parse_vienna\n\nh = open('data/ft3100_2D_with_bracket_notation.fasta')\nvienna_content = h.read()\nh.close()\n\nprint vienna_content",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": ">ft3100 FANTOM3\nTAACAATCTGCTGAAAGGTACCGTCGGAGGGAGCTTTGTTGCCAGCGCCA\nGAAACGCCGGTTTAACCAGCGCCGAAGTGAGCGCAGTGATTAAAGCCATG\nCAGTGGCAAATGGATTTCCGCAAACTGAAAAAAGGCGATGAATTTGCGGT\n.......((((.....(((((((.((...((.(((........))).)).\n....)).))))...)))..((((..(((..(((....((((...(((((.\n..))))).....))))..)))..))).......))))........)))).\n"
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "In a Vienna file, several molecules and bracket notations can be stored. Consequently, the function parse_vienna() returns a list of RNA objects and a list of pandas Dataframes."
},
{
"cell_type": "code",
"collapsed": false,
"input": "all_molecules, all_base_pairs = parse_vienna(vienna_content)\n\nfor base_pairs in all_base_pairs:\n print base_pairs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " orientation edge1 edge2 pos1 pos2\n0 c ( ) 35 44\n1 c ( ) 34 45\n2 c ( ) 33 46\n3 c ( ) 31 48\n4 c ( ) 30 49\n5 c ( ) 26 55\n6 c ( ) 25 56\n7 c ( ) 23 58\n8 c ( ) 22 59\n9 c ( ) 21 60\n10 c ( ) 20 61\n11 c ( ) 19 65\n12 c ( ) 18 66\n13 c ( ) 17 67\n14 c ( ) 99 103\n15 c ( ) 98 104\n16 c ( ) 97 105\n17 c ( ) 96 106\n18 c ( ) 95 107\n19 c ( ) 91 113\n20 c ( ) 90 114\n21 c ( ) 89 115\n22 c ( ) 88 116\n23 c ( ) 83 119\n24 c ( ) 82 120\n25 c ( ) 81 121\n26 c ( ) 78 124\n27 c ( ) 77 125\n28 c ( ) 76 126\n29 c ( ) 73 134\n30 c ( ) 72 135\n31 c ( ) 71 136\n32 c ( ) 70 137\n33 c ( ) 11 146\n34 c ( ) 10 147\n35 c ( ) 9 148\n36 c ( ) 8 149\n\n[37 rows x 5 columns]\n"
}
],
"prompt_number": 11
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Creation of secondary structures from databases"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "PyRNA allows to recover Rfam entries directly from the database website."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.db import Rfam\nrfam = Rfam(use_website = True)",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The function get_entry() returns:\n \n* a list of gapped RNA objects\n* a dictionary to map species labels to name/start-end\n* a pandas Dataframe listing the paired columns in the Rfam alignment (a.k.a a consensus secondary structure)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "gapped_rnas, organism_names_2_nse, consensus_2d = rfam.get_entry(rfam_id='RF00059')\n\nfor gapped_rna in gapped_rnas[:10]:\n print \"sequence for %s: %s\"%(gapped_rna.name, gapped_rna.sequence)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "sequence for M.leprae.1: AAAAAACCACGCGGGAGCACACACCA------------------------------------------------------------------------------------------------------------------------AGUGC-GCUGAGAGGACGGAUCGGGG------------------------------------------------------------------------------CCGUC-GA-CCGUAUGA-ACCUGA--CCGGGUAAUGCCGGCG-UAGGGA-GAUGAAUAAUGACUGAAA\nsequence for M.leprae.2: ACUUCAGUACACGGGAGUCCCCAGGCAGUCCGAAUCU-------------------------------------------------------------------------------------------------------------GCGGG-GCUGAGAGUGGGCACAGCCAGAGUGACUGAACCU----------------------------------------------------------------ACCCUUA--CCCUCCCACACCUGA-UCCGGGUCAUGCCGGCG-AAGGGA-GGUUCAAGAUGCCGUCAG\nsequence for Sinorhizobium_melilo.1: GCUGCUCUAACGGGGUGCCCUGGCCGGCUUUGCGA---------------------------------------------------------------------------------------------------------------CCAUG-GCUGAGAGGCUUC--------------------------------------------------------------------------------------GAGCCAA-CCCGCGGA-ACCUGA-UCCGGCUCAUACCGGCG-GAGGGA--UUAGAAGCGAUCGGACC\nsequence for S.pneumoniae.1: UAAAGACAUUUGGGGUGCUU--------------------------------------------------------------------------------------------------------------------------------UAG-GCUGAGAU----------------------------------------------------------------------------------------------GAUA-CCCAUUGA-ACCUGA-UACAGUUAAGACUGGCG-AAGGGA--AAUGUGAAAUGUUUUUU\nsequence for Lactococcus_lactis_s.1: UAUUUGCACAAUGGGUCUAUUGACAAGUCUG-------------------------------------------------------------------------------------------------------------------UCAGU-AGCGAGAA-----------------------------------------------------------------------------------------------AUA-CCAUCUG--ACCUGA-UCUGGGUAAUGCCAGCG-UAGGAA--UGUGUUAAGACAAUCGU\nsequence for Corynebacterium_glut.1: AUACUAGGCACGGGGUGCCAACCGGAUGGAAAAAUUC-------------------------------------------------------------------------------------------------------------CGGAG-GCUGAGAAA----------------------------------------------------------------------------------------------ACA-CCCGUUGA-ACCUGC-UCUAGCUCGUACUAGCG-AAGGGA--UGGCCUUAACGUGGCUA\nsequence for C.callunae.1: GGCAGUCCCCACGGGCGCCCGAGC---------------------------------------------------------------------------------------------------------------------------ACGG-GCUGAGAUCGCGCUGAUUUUG-----------------------------------------------------------------------------CGCGAGCA-CCGUUUGA-ACCUG--UCCGGUUAGCACCGGCG-AAGGAA-GAGAGGAAUGGUGCAAUG\nsequence for Bacillus_subtilis_su.2: GAUUCAUCCUAGGGGUGCUUUG-----------------------------------------------------------------------------------------------------------------------------CGAA-GCUGAGAGAGACUU------------------------------------------------------------------------------------UGUCUCAA-CCCUUUUG-ACCUGA-UCUGGAUCAUGCCAGCG-GAGGGA--AGCGGUGAAAAGCGGAG\nsequence for Bacillus_subtilis_su.3: GAUGACCACAAGGGGAGCAUU-------------------------------------------------------------------------------------------------------------------------------AAA-GCUGAGAGUGAGCGGUUU--------------------------------------------------------------------------------CGUUCUGA-CCCUUUGA-ACCUG--U-UAGUUAACGCUGGCG-UAGGGA--UGUGGCAAAGUCAAAUG\nsequence for Bacillus_subtilis_su.1: UUUAACCACUAGGGGUGUCCUUCA--------------------------------------------------------------------------------------------------------------------------UAAGG-GCUGAGAUAAAAGUGUGA--------------------------------------------------------------------------------CUUUUAGA-CCCUCAUA-ACUUGA-ACAGGUUCAGACCUGCG-UAGGGA--AGUGGAGCGGUAUUUGU\n"
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The function consensus2d_to_base_pairs() allows to easily infer the secondary structure for a given RNA sequence"
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.parsers import consensus2d_to_base_pairs, to_bn\n\nfor gapped_rna in gapped_rnas[:10]:\n rna = RNA(name = gapped_rna.name, sequence = gapped_rna.sequence.replace('-',''))\n print rna.name\n print rna.sequence\n print to_bn(consensus2d_to_base_pairs(gapped_rna, consensus_2d), len(rna))",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "M.leprae.1\nAAAAAACCACGCGGGAGCACACACCAAGUGCGCUGAGAGGACGGAUCGGGGCCGUCGACCGUAUGAACCUGACCGGGUAAUGCCGGCGUAGGGAGAUGAAUAAUGACUGAAA\n.....(((((((((..(((((.......)))))......(((((.......)))))..)))).....((((.((((......))))..))))...)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nM.leprae.2\nACUUCAGUACACGGGAGUCCCCAGGCAGUCCGAAUCUGCGGGGCUGAGAGUGGGCACAGCCAGAGUGACUGAACCUACCCUUACCCUCCCACACCUGAUCCGGGUCAUGCCGGCGAAGGGAGGUUCAAGAUGCCGUCAG\n.....(((((((((..(((((..................))))).....((((((.....................)))))).))))......((((..((((......))))..))))...)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nSinorhizobium_melilo.1\nGCUGCUCUAACGGGGUGCCCUGGCCGGCUUUGCGACCAUGGCUGAGAGGCUUCGAGCCAACCCGCGGAACCUGAUCCGGCUCAUACCGGCGGAGGGAUUAGAAGCGAUCGGACC\n.....(((((((((..(((((................))))).....(((((.)))))..)))).....((((..((((......))))..))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nS.pneumoniae.1\nUAAAGACAUUUGGGGUGCUUUAGGCUGAGAUGAUACCCAUUGAACCUGAUACAGUUAAGACUGGCGAAGGGAAAUGUGAAAUGUUUUUU\n.....(((((((((..((((.)))).....(.)..)))).....((((..((((......))))..))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nLactococcus_lactis_s.1\nUAUUUGCACAAUGGGUCUAUUGACAAGUCUGUCAGUAGCGAGAAAUACCAUCUGACCUGAUCUGGGUAAUGCCAGCGUAGGAAUGUGUUAAGACAAUCGU\n.....(((((((((..(((((............))))).....()..))))....((((..((((......))))..))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nCorynebacterium_glut.1\nAUACUAGGCACGGGGUGCCAACCGGAUGGAAAAAUUCCGGAGGCUGAGAAAACACCCGUUGAACCUGCUCUAGCUCGUACUAGCGAAGGGAUGGCCUUAACGUGGCUA\n.....(((((((((..(((((..................))))).....(.)..)))).....((((..((((......))))..))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nC.callunae.1\nGGCAGUCCCCACGGGCGCCCGAGCACGGGCUGAGAUCGCGCUGAUUUUGCGCGAGCACCGUUUGAACCUGUCCGGUUAGCACCGGCGAAGGAAGAGAGGAAUGGUGCAAUG\n.....(((((((((..(((((....))))).....((((((........))))))..)))).....((((.((((......))))..))))...)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nBacillus_subtilis_su.2\nGAUUCAUCCUAGGGGUGCUUUGCGAAGCUGAGAGAGACUUUGUCUCAACCCUUUUGACCUGAUCUGGAUCAUGCCAGCGGAGGGAAGCGGUGAAAAGCGGAG\n.....(((((((((..(((((..))))).....((((((.))))))..)))).....((((..((((......))))..))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nBacillus_subtilis_su.3\nGAUGACCACAAGGGGAGCAUUAAAGCUGAGAGUGAGCGGUUUCGUUCUGACCCUUUGAACCUGUUAGUUAACGCUGGCGUAGGGAUGUGGCAAAGUCAAAUG\n.....(((((((((..((((())))).....((((((.....))))))..)))).....((((.(((......)))...))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\nBacillus_subtilis_su.1\nUUUAACCACUAGGGGUGUCCUUCAUAAGGGCUGAGAUAAAAGUGUGACUUUUAGACCCUCAUAACUUGAACAGGUUCAGACCUGCGUAGGGAAGUGGAGCGGUAUUUGU\n.....(((((((((..(((((.....))))).....((((((.....))))))..)))).....((((..((((......))))..))))..)))))............"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n"
}
],
"prompt_number": 14
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Creation of secondary structures from algorithms"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To run computations on my own server, we need an api_key. This is a security to avoid flood from bots. This api_key will be used to instantiate classes from the module pyrna.computations. Without this key, the server will return an unauthorized error."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.computations import get_api_key\nrest_server = \"arn-ibmc.in2p3.fr\"\napi_key = get_api_key(rest_server)\nprint api_key",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "DY6JE44YOY\n"
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we can now launch some computations. First, we can predict a secondary structure from an RNA molecule. The secondary structure will be described in a pandas Dataframe."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.computations import Rnafold\nfrom pyrna.features import RNA\nrna = RNA(name = 'my_rna', sequence = 'GGGGTAGGGACGGTAGGGGGACGCAGTGCAGTAACGTACCCGGTAGGGGGTAGGGGGACGCAGTAACCCCGGGGACGCAGTAACCCCACGCAGTAACCCC')\nrnafold = Rnafold(rest_server = rest_server, api_key = api_key) #if we do Rnafold(), the algorithm will be launched locally\nbase_pairs = rnafold.fold(rna)\nprint base_pairs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " orientation edge1 edge2 pos1 pos2\n0 c ( ) 41 48\n1 c ( ) 40 49\n2 c ( ) 39 50\n3 c ( ) 38 51\n4 c ( ) 37 52\n5 c ( ) 56 67\n6 c ( ) 55 68\n7 c ( ) 54 69\n8 c ( ) 53 70\n9 c ( ) 29 77\n10 c ( ) 28 78\n11 c ( ) 27 79\n12 c ( ) 24 80\n13 c ( ) 23 81\n14 c ( ) 20 84\n15 c ( ) 19 85\n16 c ( ) 18 86\n17 c ( ) 17 87\n18 c ( ) 14 90\n19 c ( ) 13 91\n20 c ( ) 11 93\n21 c ( ) 10 94\n22 c ( ) 5 96\n23 c ( ) 4 97\n24 c ( ) 3 98\n25 c ( ) 2 99\n26 c ( ) 1 100\n\n[27 rows x 5 columns]\n"
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Contrafold is also available for computations on my own server."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.computations import Contrafold\ncontrafold = Contrafold(rest_server = rest_server, api_key = api_key)\nbase_pairs = contrafold.fold(rna)\nprint base_pairs",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " orientation edge1 edge2 pos1 pos2\n0 c ( ) 41 48\n1 c ( ) 40 49\n2 c ( ) 39 50\n3 c ( ) 38 51\n4 c ( ) 37 52\n5 c ( ) 70 71\n6 c ( ) 69 72\n7 c ( ) 68 73\n8 c ( ) 67 74\n9 c ( ) 36 76\n10 c ( ) 35 77\n11 c ( ) 26 78\n12 c ( ) 24 80\n13 c ( ) 23 81\n14 c ( ) 19 84\n15 c ( ) 18 85\n16 c ( ) 17 86\n17 c ( ) 16 87\n18 c ( ) 12 89\n19 c ( ) 11 90\n20 c ( ) 10 94\n21 c ( ) 5 96\n22 c ( ) 4 97\n23 c ( ) 3 98\n24 c ( ) 2 99\n25 c ( ) 1 100\n\n[26 rows x 5 columns]\n"
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can also compute the 2D coordinates for a plot."
},
{
"cell_type": "code",
"collapsed": true,
"input": "from pyrna.computations import Rnaplot\nrnaplot = Rnaplot(api_key = api_key)\nplot = rnaplot.plot(secondary_structure = base_pairs, rna = rna)\nprint plot",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " x y\n1 65.468 -7.471\n2 65.468 7.529\n3 65.468 22.529\n4 65.468 37.529\n5 65.468 52.529\n6 58.013 57.898\n7 53.897 66.267\n8 54.204 75.728\n9 58.994 84.054\n10 67.254 89.223\n11 67.954 104.207\n12 67.954 119.207\n13 59.962 125.915\n14 57.013 136.155\n15 60.348 146.509\n16 69.069 153.376\n17 70.002 168.347\n18 70.934 183.318\n19 71.867 198.289\n20 62.010 206.746\n21 58.830 219.541\n22 63.724 231.993\n23 75.092 239.412\n24 76.490 254.347\n25 69.972 267.739\n26 78.962 280.759\n27 67.122 271.549\n28 52.140 270.819\n29 39.461 278.834\n30 33.693 292.680\n31 36.933 307.326\n32 48.004 317.448\n33 62.881 319.366\n34 76.157 312.384\n35 83.006 299.040\n36 88.039 313.170\n37 87.806 328.168\n38 82.336 342.135\n39 76.866 356.103\n40 71.396 370.070\n41 65.927 384.037\n42 51.347 390.137\n43 45.429 404.791\n44 51.683 419.306\n45 66.399 425.069\n46 80.846 418.662\n47 86.453 403.885\n48 79.894 389.507\n49 85.364 375.539\n50 90.833 361.572\n51 96.303 347.605\n52 101.773 333.638\n53 103.669 348.518\n54 105.565 363.397\n55 101.785 377.913\n56 104.727 392.622\n57 113.801 404.566\n58 127.182 411.345\n59 142.180 411.596\n60 155.780 405.269\n ... ...\n\n[100 rows x 2 columns]\n"
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": "We can also compute a secondary structure by annotating a tertiary one. First, we need to import a 3D structure from the Protein Databank."
},
{
"cell_type": "code",
"collapsed": false,
"input": "from pyrna.db import PDB\nfrom pyrna.parsers import parse_pdb\npdb = PDB()\ntertiary_structures = parse_pdb(pdb.get_entry('1HR2'))",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": "With PyRNA, a pyrna.features.TertiaryStructure object is made with a single molecular chain. Consequently, the function parse_pdb returns a list of such objects. We can iterate over these pyrna.features.TertiaryStructure objects to annotate them with the algorithm Rnaview (Rnaview is available for computations from my own server)."
},
{
"cell_type": "code",
"collapsed": false,
"input": "secondary_structures = []\nfor ts in tertiary_structures:\n #the function annotate() from Rnaview returns a pyrna.features.SecondaryStructure object and its 3D counterpart as a pyrna.features.TertiaryStructure object\n secondary_structure, tertiary_structure = Rnaview(rest_server = rest_server, api_key = api_key).annotate(ts)\n secondary_structures.append(secondary_structure)\n #the function secondary_structure_to_base_pairs() transform a pyrna.features.SecondaryStructure object into a list of base pairs stored in a pandas Dataframe\n print \"Molecular chain %s\"%secondary_structure.rna.name\n print secondary_structure_to_base_pairs(secondary_structure, keep_tertiaries = True)",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Molecular chain A\n edge1 edge2 orientation pos1 pos2\n0 ( ) c 4 112\n1 ( ) c 5 111\n2 ( ) c 6 110\n3 ( ) c 7 109\n4 ( ) c 8 108\n5 ( ) c 9 107\n6 ( ) c 10 106\n7 { ] t 11 105\n8 [ } t 12 104\n9 ( ) c 14 103\n10 ( ) c 15 102\n11 ( ) c 16 101\n12 ( ) c 17 100\n13 ( ) c 18 99\n14 ( ) c 19 98\n15 ( ) c 24 94\n16 ( ) c 25 93\n17 ( ) c 26 92\n18 ( ) c 27 91\n19 ( ) c 30 89\n20 ( ) c 31 88\n21 ( ) c 32 87\n22 ( ) c 34 80\n23 ( ) c 35 79\n24 ( ) c 36 78\n25 [ } t 37 62\n26 [ } t 38 61\n27 ( ) c 39 60\n28 ( ) c 40 59\n29 ( ) c 41 58\n30 ( ) c 42 57\n31 ( ) c 43 56\n32 ( ) c 44 55\n33 ( ) c 45 54\n34 ( ) c 46 53\n35 ( ) c 47 52\n36 ( ) c 63 73\n37 ( ) c 64 72\n38 ! ! ! 65 71\n39 ( ) c 117 150\n40 ( ) c 118 149\n41 ( ) c 119 148\n42 ( ) c 120 147\n43 ( ) c 124 144\n44 ( ) c 125 143\n45 ( ) c 126 142\n46 ( ) c 127 141\n47 ( ) c 128 140\n48 ( ) c 129 139\n49 ( ) c 130 138\n50 ( ) c 131 137\n51 ( ) c 132 136\n52 [ ) c 4 156\n53 ( ] t 21 96\n54 ( } t 22 99\n55 ( ] c 33 85\n56 { ] t 48 51\n57 ( ) t 49 145\n58 { } t 51 147\n59 [ ) t 66 86\n ... ... ... ... ...\n\n[83 rows x 5 columns]\nMolecular chain B"
},
{
"output_type": "stream",
"stream": "stdout",
"text": "\n edge1 edge2 orientation pos1 pos2\n0 ( ) c 2 115\n1 ( ) c 3 114\n2 ( ) c 4 113\n3 ( ) c 6 112\n4 ( ) c 7 111\n5 ( ) c 8 110\n6 ( ) c 9 109\n7 ( ) c 10 108\n8 ( ) c 11 107\n9 { ] t 12 106\n10 [ } t 13 105\n11 ( ) c 15 104\n12 ( ) c 16 103\n13 ( ) c 17 102\n14 ( ) c 18 101\n15 ( ) c 19 100\n16 ( ) c 20 99\n17 ( ) c 25 95\n18 ( ) c 26 94\n19 ( ) c 27 93\n20 ( ) c 28 92\n21 ( ) c 30 91\n22 ( ) c 31 90\n23 ( ) c 32 89\n24 ( ) c 33 88\n25 ( ) c 35 81\n26 ( ) c 36 80\n27 ( ) c 37 79\n28 [ } t 38 63\n29 [ } t 39 62\n30 ( ) c 40 61\n31 ( ) c 41 60\n32 ( ) c 42 59\n33 ( ) c 43 58\n34 ( ) c 44 57\n35 ( ) c 45 56\n36 ( ) c 46 55\n37 ( ) c 47 54\n38 ( ) c 48 53\n39 ( ) c 64 74\n40 ( ) c 65 73\n41 ! ! ! 66 72\n42 ( ) c 118 151\n43 ( ) c 119 150\n44 ( ) c 120 149\n45 ( ) c 121 148\n46 ( ) c 125 145\n47 ( ) c 126 144\n48 ( ) c 127 143\n49 ( ) c 128 142\n50 ( ) c 129 141\n51 ( ) c 130 140\n52 ( ) c 131 139\n53 ( ) c 132 138\n54 ( ) c 133 137\n55 ( } c 1 118\n56 [ ) c 2 154\n57 [ ) c 4 156\n58 [ ) c 6 157\n59 [ ) c 7 158\n ... ... ... ... ...\n\n[89 rows x 5 columns]\n"
}
],
"prompt_number": 20
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Mining of a Secondary Structure"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we can search for all base pairs made with at least one Hoogsteen edge. Since the variable base_pairs stores a pandas Dataframe, we're using the operators available from pandas (http://pandas.pydata.org/pandas-docs/stable/indexing.html)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "base_pairs = secondary_structure_to_base_pairs(secondary_structures[0], keep_tertiaries = True)\nprint base_pairs[(base_pairs['edge1'] == '[') | (base_pairs['edge2'] == ']')]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": " edge1 edge2 orientation pos1 pos2\n7 { ] t 11 105\n8 [ } t 12 104\n25 [ } t 37 62\n26 [ } t 38 61\n52 [ ) c 4 156\n53 ( ] t 21 96\n55 ( ] c 33 85\n56 { ] t 48 51\n59 [ ) t 66 86\n60 { ] c 69 70\n64 { ] c 114 115\n65 ( ] t 121 145\n66 { ] c 122 123\n\n[13 rows x 5 columns]\n"
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": "A pyrna.features.SecondaryStructure can find all its junctions (meaning the group of single-strands linking helices). \n\nAccording to the number of single-strands making a junction, we distinguish:\n \n* an apical looop (a single single-strand)\n* an inner loop (two single-strands)\n* an three-way junction (three single-strands)\n* a four-way junction (four single-strands)\n* a n-way junction (n single-strands)"
},
{
"cell_type": "code",
"collapsed": false,
"input": "junctions = []\nfor ss in secondary_structures:\n ss.find_junctions()\n junctions += ss.junctions",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": "A junction is described with:\n\n* a list of single-strands making the junction\n* a location: the positions in sequence of the residues \n* a description: a string made with the single-strand sequences joined with single space characters. In the description of a junction, the character '-' means a direct link (phosphodiester bond) between two helices. Consequently, we have no single-strand linking these helices."
},
{
"cell_type": "code",
"collapsed": false,
"input": "print junctions[0]",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "{'single_strands': [{'name': 'SS_2', 'location': [13, 13]}], 'description': 'A -', 'location': [[12, 14], [103, 104]]}\n"
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Now we can search for all GNRA apical loops..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "import re\nfor junction in junctions:\n if re.match('G[AUGC][AG]A',junction['description']):\n print \"Apical loop with sequence %s\"%junction['description']",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Apical loop with sequence GAAA\nApical loop with sequence GAAA\n"
}
],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Or for all three-way junctions..."
},
{
"cell_type": "code",
"collapsed": false,
"input": "for junction in junctions:\n if len(junction['location']) == 3:\n print \"Three-way junctions with sequence %s\"%junction['description']",
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": "Three-way junctions with sequence GUAU - -\nThree-way junctions with sequence GUAU - -\n"
}
],
"prompt_number": 25
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment