fjossinet/gist:9035788

## gistfile1.txt
{
 "metadata": {
  "name": "Create and manipulate secondary structures."
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "PyRNA stores and describes an RNA secondary structure using two different data structures:\n\n* a pandas Dataframe\n* a pyrna.features.SecondaryStructure object\n\nThe pandas Dataframe lists all the base pairs (a.k.a interactions) making the secondary structure. If a pandas Dataframe restricts the exploration of a secondary structure to the level of the base pairs, a pyrna.features.SecondaryStructure gives us access to high-level objects like: \n\n* helices,\n* single-strands,\n* junctions.\n\nIn both data structures, the edges of the base pairs are described with three different characters:\n\n    * '(' or ')' for Watson-Crick edges,\n    * '[' or ']' for Hoogsteen edges,\n    * '{' or '}' for Sugar edges.\n    \nA base pair can also be in a cis ('c') or trans ('t') orientation."
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "Creation of secondary structures from scratch"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import json #to have a better output for dict describing pyrna.features.RNA objects (interactions, helices, single-strands,..)\n\nfrom pyrna.parsers import parse_bn\nrna = RNA(name = 'my_rna', sequence = 'GGGGGACAACCCC')\nbn = '(({(.....))))'\nbase_pairs = parse_bn(bn)\nprint base_pairs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "  orientation edge1 edge2  pos1  pos2\n0           c     (     )     4    10\n1           c     {     )     3    11\n2           c     (     )     2    12\n3           c     (     )     1    13\n\n[4 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now we transform this list of base-pairs into a pyrna.features.SecondaryStructure object."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.parsers import base_pairs_to_secondary_structure\nss = base_pairs_to_secondary_structure(rna, base_pairs)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now we can browse over the helices. An helix contains a field to store non-canonical base pairs (meaning base pairs different from c() AU, c() GU or c() GC. "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "for helix in ss.helices:\n    print json.dumps(helix, indent = 2)\n    print \"\\nAny non canonical base pairs in this helix?\\n\"\n    for interaction in helix['interactions']:\n        print json.dumps(interaction, indent = 2)    ",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "{\n  \"length\": 4, \n  \"interactions\": [\n    {\n      \"edge1\": \"{\", \n      \"edge2\": \")\", \n      \"orientation\": \"c\", \n      \"location\": [\n        [\n          3, \n          3\n        ], \n        [\n          11, \n          11\n        ]\n      ]\n    }\n  ], \n  \"name\": \"H1\", \n  \"location\": [\n    [\n      1, \n      4\n    ], \n    [\n      10, \n      13\n    ]\n  ]\n}\n\nAny non canonical base pairs in this helix?\n\n{\n  \"edge1\": \"{\", \n  \"edge2\": \")\", \n  \"orientation\": \"c\", \n  \"location\": [\n    [\n      3, \n      3\n    ], \n    [\n      11, \n      11\n    ]\n  ]\n}\n"
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "or over the single-strands..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "for single_strand in ss.single_strands:\n    print json.dumps(single_strand, indent = 2)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "{\n  \"name\": \"SS1\", \n  \"location\": [\n    5, \n    9\n  ]\n}\n"
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Any high-level object from a pyrna.features.SecondaryStructure is described with a \"location\". A location is a N x 2 matrix where N is the number of blocks of contiguous residues, each block being described with its start and end positions.\n\nAn helix location will be a matrix like [ [1,4] , [10,13] ], meaning that the first strand is between residues 1 and 4, and the second strand is between residues 10 and 13.\n\nA base pair location will be a matrix like [ [3,3] , [11,11] ], meaning a base pair between residues 3 and 11."
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "From a pyrna.features.SecondaryStructure object, we can easily go back to a list of base pairs stored in a pandas Dataframe."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.parsers import secondary_structure_to_base_pairs\nprint secondary_structure_to_base_pairs(ss)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "  edge1 edge2 orientation  pos1  pos2\n0     (     )           c     1    13\n1     (     )           c     2    12\n2     {     )           c     3    11\n3     (     )           c     4    10\n\n[4 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now we can create a new secondary structure containing a single tertiary interaction. A tertiary interaction is defined as a base pair which is not contiguous to another base-pair."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "rna = RNA(name = 'my_rna', sequence = 'GGGGGACGCAGTAACCCC')\nbn = '(({(.(.....)..))))'\nbase_pairs = parse_bn(bn)\nprint base_pairs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "  orientation edge1 edge2  pos1  pos2\n0           c     (     )     6    12\n1           c     (     )     4    15\n2           c     {     )     3    16\n3           c     (     )     2    17\n4           c     (     )     1    18\n\n[5 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "The single tertiary interaction is stored in a pyrna.features.SecondaryStructure object."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "ss = base_pairs_to_secondary_structure(rna, base_pairs)\nfor tertiary_interaction in ss.tertiary_interactions:\n    print json.dumps(tertiary_interaction, indent = 2)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "{\n  \"edge1\": \"(\", \n  \"edge2\": \")\", \n  \"orientation\": \"c\", \n  \"location\": [\n    [\n      6, \n      6\n    ], \n    [\n      12, \n      12\n    ]\n  ]\n}\n"
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "If we want to go back to a list of base pairs in a pandas Dataframe, we need to precise if we want to keep the tertiary interactions..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print secondary_structure_to_base_pairs(ss, keep_tertiaries=True)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "  edge1 edge2 orientation  pos1  pos2\n0     (     )           c     1    18\n1     (     )           c     2    17\n2     {     )           c     3    16\n3     (     )           c     4    15\n4     (     )           c     6    12\n\n[5 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "...or not"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print secondary_structure_to_base_pairs(ss)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "  edge1 edge2 orientation  pos1  pos2\n0     (     )           c     1    18\n1     (     )           c     2    17\n2     {     )           c     3    16\n3     (     )           c     4    15\n\n[4 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "Creation of secondary structures from files"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "PyRNA is able to parse several file formats describing secondary structures. "
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.parsers import parse_vienna\n\nh = open('data/ft3100_2D_with_bracket_notation.fasta')\nvienna_content = h.read()\nh.close()\n\nprint vienna_content",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": ">ft3100 FANTOM3\nTAACAATCTGCTGAAAGGTACCGTCGGAGGGAGCTTTGTTGCCAGCGCCA\nGAAACGCCGGTTTAACCAGCGCCGAAGTGAGCGCAGTGATTAAAGCCATG\nCAGTGGCAAATGGATTTCCGCAAACTGAAAAAAGGCGATGAATTTGCGGT\n.......((((.....(((((((.((...((.(((........))).)).\n....)).))))...)))..((((..(((..(((....((((...(((((.\n..))))).....))))..)))..))).......))))........)))).\n"
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "In a Vienna file, several molecules and bracket notations can be stored. Consequently, the function parse_vienna() returns a list of RNA objects and a list of pandas Dataframes."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "all_molecules, all_base_pairs = parse_vienna(vienna_content)\n\nfor base_pairs in all_base_pairs:\n    print base_pairs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "   orientation edge1 edge2  pos1  pos2\n0            c     (     )    35    44\n1            c     (     )    34    45\n2            c     (     )    33    46\n3            c     (     )    31    48\n4            c     (     )    30    49\n5            c     (     )    26    55\n6            c     (     )    25    56\n7            c     (     )    23    58\n8            c     (     )    22    59\n9            c     (     )    21    60\n10           c     (     )    20    61\n11           c     (     )    19    65\n12           c     (     )    18    66\n13           c     (     )    17    67\n14           c     (     )    99   103\n15           c     (     )    98   104\n16           c     (     )    97   105\n17           c     (     )    96   106\n18           c     (     )    95   107\n19           c     (     )    91   113\n20           c     (     )    90   114\n21           c     (     )    89   115\n22           c     (     )    88   116\n23           c     (     )    83   119\n24           c     (     )    82   120\n25           c     (     )    81   121\n26           c     (     )    78   124\n27           c     (     )    77   125\n28           c     (     )    76   126\n29           c     (     )    73   134\n30           c     (     )    72   135\n31           c     (     )    71   136\n32           c     (     )    70   137\n33           c     (     )    11   146\n34           c     (     )    10   147\n35           c     (     )     9   148\n36           c     (     )     8   149\n\n[37 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "Creation of secondary structures from databases"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "PyRNA allows to recover Rfam entries directly from the database website."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.db import Rfam\nrfam = Rfam(use_website = True)",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "The function get_entry() returns:\n    \n* a list of gapped RNA objects\n* a dictionary to map species labels to name/start-end\n* a pandas Dataframe listing the paired columns in the Rfam alignment (a.k.a a consensus secondary structure)"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "gapped_rnas, organism_names_2_nse, consensus_2d = rfam.get_entry(rfam_id='RF00059')\n\nfor gapped_rna in gapped_rnas[:10]:\n    print \"sequence for %s: %s\"%(gapped_rna.name, gapped_rna.sequence)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "sequence for M.leprae.1: AAAAAACCACGCGGGAGCACACACCA------------------------------------------------------------------------------------------------------------------------AGUGC-GCUGAGAGGACGGAUCGGGG------------------------------------------------------------------------------CCGUC-GA-CCGUAUGA-ACCUGA--CCGGGUAAUGCCGGCG-UAGGGA-GAUGAAUAAUGACUGAAA\nsequence for M.leprae.2: ACUUCAGUACACGGGAGUCCCCAGGCAGUCCGAAUCU-------------------------------------------------------------------------------------------------------------GCGGG-GCUGAGAGUGGGCACAGCCAGAGUGACUGAACCU----------------------------------------------------------------ACCCUUA--CCCUCCCACACCUGA-UCCGGGUCAUGCCGGCG-AAGGGA-GGUUCAAGAUGCCGUCAG\nsequence for Sinorhizobium_melilo.1: GCUGCUCUAACGGGGUGCCCUGGCCGGCUUUGCGA---------------------------------------------------------------------------------------------------------------CCAUG-GCUGAGAGGCUUC--------------------------------------------------------------------------------------GAGCCAA-CCCGCGGA-ACCUGA-UCCGGCUCAUACCGGCG-GAGGGA--UUAGAAGCGAUCGGACC\nsequence for S.pneumoniae.1: UAAAGACAUUUGGGGUGCUU--------------------------------------------------------------------------------------------------------------------------------UAG-GCUGAGAU----------------------------------------------------------------------------------------------GAUA-CCCAUUGA-ACCUGA-UACAGUUAAGACUGGCG-AAGGGA--AAUGUGAAAUGUUUUUU\nsequence for Lactococcus_lactis_s.1: UAUUUGCACAAUGGGUCUAUUGACAAGUCUG-------------------------------------------------------------------------------------------------------------------UCAGU-AGCGAGAA-----------------------------------------------------------------------------------------------AUA-CCAUCUG--ACCUGA-UCUGGGUAAUGCCAGCG-UAGGAA--UGUGUUAAGACAAUCGU\nsequence for Corynebacterium_glut.1: AUACUAGGCACGGGGUGCCAACCGGAUGGAAAAAUUC-------------------------------------------------------------------------------------------------------------CGGAG-GCUGAGAAA----------------------------------------------------------------------------------------------ACA-CCCGUUGA-ACCUGC-UCUAGCUCGUACUAGCG-AAGGGA--UGGCCUUAACGUGGCUA\nsequence for C.callunae.1: GGCAGUCCCCACGGGCGCCCGAGC---------------------------------------------------------------------------------------------------------------------------ACGG-GCUGAGAUCGCGCUGAUUUUG-----------------------------------------------------------------------------CGCGAGCA-CCGUUUGA-ACCUG--UCCGGUUAGCACCGGCG-AAGGAA-GAGAGGAAUGGUGCAAUG\nsequence for Bacillus_subtilis_su.2: GAUUCAUCCUAGGGGUGCUUUG-----------------------------------------------------------------------------------------------------------------------------CGAA-GCUGAGAGAGACUU------------------------------------------------------------------------------------UGUCUCAA-CCCUUUUG-ACCUGA-UCUGGAUCAUGCCAGCG-GAGGGA--AGCGGUGAAAAGCGGAG\nsequence for Bacillus_subtilis_su.3: GAUGACCACAAGGGGAGCAUU-------------------------------------------------------------------------------------------------------------------------------AAA-GCUGAGAGUGAGCGGUUU--------------------------------------------------------------------------------CGUUCUGA-CCCUUUGA-ACCUG--U-UAGUUAACGCUGGCG-UAGGGA--UGUGGCAAAGUCAAAUG\nsequence for Bacillus_subtilis_su.1: UUUAACCACUAGGGGUGUCCUUCA--------------------------------------------------------------------------------------------------------------------------UAAGG-GCUGAGAUAAAAGUGUGA--------------------------------------------------------------------------------CUUUUAGA-CCCUCAUA-ACUUGA-ACAGGUUCAGACCUGCG-UAGGGA--AGUGGAGCGGUAUUUGU\n"
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "The function consensus2d_to_base_pairs() allows to easily infer the secondary structure for a given RNA sequence"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.parsers import consensus2d_to_base_pairs, to_bn\n\nfor gapped_rna in gapped_rnas[:10]:\n    rna = RNA(name = gapped_rna.name, sequence = gapped_rna.sequence.replace('-',''))\n    print rna.name\n    print rna.sequence\n    print to_bn(consensus2d_to_base_pairs(gapped_rna, consensus_2d), len(rna))",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "M.leprae.1\nAAAAAACCACGCGGGAGCACACACCAAGUGCGCUGAGAGGACGGAUCGGGGCCGUCGACCGUAUGAACCUGACCGGGUAAUGCCGGCGUAGGGAGAUGAAUAAUGACUGAAA\n.....(((((((((..(((((.......)))))......(((((.......)))))..)))).....((((.((((......))))..))))...)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nM.leprae.2\nACUUCAGUACACGGGAGUCCCCAGGCAGUCCGAAUCUGCGGGGCUGAGAGUGGGCACAGCCAGAGUGACUGAACCUACCCUUACCCUCCCACACCUGAUCCGGGUCAUGCCGGCGAAGGGAGGUUCAAGAUGCCGUCAG\n.....(((((((((..(((((..................))))).....((((((.....................)))))).))))......((((..((((......))))..))))...)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nSinorhizobium_melilo.1\nGCUGCUCUAACGGGGUGCCCUGGCCGGCUUUGCGACCAUGGCUGAGAGGCUUCGAGCCAACCCGCGGAACCUGAUCCGGCUCAUACCGGCGGAGGGAUUAGAAGCGAUCGGACC\n.....(((((((((..(((((................))))).....(((((.)))))..)))).....((((..((((......))))..))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nS.pneumoniae.1\nUAAAGACAUUUGGGGUGCUUUAGGCUGAGAUGAUACCCAUUGAACCUGAUACAGUUAAGACUGGCGAAGGGAAAUGUGAAAUGUUUUUU\n.....(((((((((..((((.)))).....(.)..)))).....((((..((((......))))..))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nLactococcus_lactis_s.1\nUAUUUGCACAAUGGGUCUAUUGACAAGUCUGUCAGUAGCGAGAAAUACCAUCUGACCUGAUCUGGGUAAUGCCAGCGUAGGAAUGUGUUAAGACAAUCGU\n.....(((((((((..(((((............))))).....()..))))....((((..((((......))))..))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nCorynebacterium_glut.1\nAUACUAGGCACGGGGUGCCAACCGGAUGGAAAAAUUCCGGAGGCUGAGAAAACACCCGUUGAACCUGCUCUAGCUCGUACUAGCGAAGGGAUGGCCUUAACGUGGCUA\n.....(((((((((..(((((..................))))).....(.)..)))).....((((..((((......))))..))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nC.callunae.1\nGGCAGUCCCCACGGGCGCCCGAGCACGGGCUGAGAUCGCGCUGAUUUUGCGCGAGCACCGUUUGAACCUGUCCGGUUAGCACCGGCGAAGGAAGAGAGGAAUGGUGCAAUG\n.....(((((((((..(((((....))))).....((((((........))))))..)))).....((((.((((......))))..))))...)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nBacillus_subtilis_su.2\nGAUUCAUCCUAGGGGUGCUUUGCGAAGCUGAGAGAGACUUUGUCUCAACCCUUUUGACCUGAUCUGGAUCAUGCCAGCGGAGGGAAGCGGUGAAAAGCGGAG\n.....(((((((((..(((((..))))).....((((((.))))))..)))).....((((..((((......))))..))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nBacillus_subtilis_su.3\nGAUGACCACAAGGGGAGCAUUAAAGCUGAGAGUGAGCGGUUUCGUUCUGACCCUUUGAACCUGUUAGUUAACGCUGGCGUAGGGAUGUGGCAAAGUCAAAUG\n.....(((((((((..((((())))).....((((((.....))))))..)))).....((((.(((......)))...))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\nBacillus_subtilis_su.1\nUUUAACCACUAGGGGUGUCCUUCAUAAGGGCUGAGAUAAAAGUGUGACUUUUAGACCCUCAUAACUUGAACAGGUUCAGACCUGCGUAGGGAAGUGGAGCGGUAUUUGU\n.....(((((((((..(((((.....))))).....((((((.....))))))..)))).....((((..((((......))))..))))..)))))............"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\n"
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "Creation of secondary structures from algorithms"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "To run computations on my own server, we need an api_key. This is a security to avoid flood from bots. This api_key will be used to instantiate classes from the module pyrna.computations. Without this key, the server will return an unauthorized error."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.computations import get_api_key\nrest_server = \"arn-ibmc.in2p3.fr\"\napi_key = get_api_key(rest_server)\nprint api_key",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "DY6JE44YOY\n"
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now we can now launch some computations. First, we can predict a secondary structure from an RNA molecule. The secondary structure will be described in a pandas Dataframe."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.computations import Rnafold\nfrom pyrna.features import RNA\nrna = RNA(name = 'my_rna', sequence = 'GGGGTAGGGACGGTAGGGGGACGCAGTGCAGTAACGTACCCGGTAGGGGGTAGGGGGACGCAGTAACCCCGGGGACGCAGTAACCCCACGCAGTAACCCC')\nrnafold = Rnafold(rest_server = rest_server, api_key = api_key) #if we do Rnafold(), the algorithm will be launched locally\nbase_pairs = rnafold.fold(rna)\nprint base_pairs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "   orientation edge1 edge2  pos1  pos2\n0            c     (     )    41    48\n1            c     (     )    40    49\n2            c     (     )    39    50\n3            c     (     )    38    51\n4            c     (     )    37    52\n5            c     (     )    56    67\n6            c     (     )    55    68\n7            c     (     )    54    69\n8            c     (     )    53    70\n9            c     (     )    29    77\n10           c     (     )    28    78\n11           c     (     )    27    79\n12           c     (     )    24    80\n13           c     (     )    23    81\n14           c     (     )    20    84\n15           c     (     )    19    85\n16           c     (     )    18    86\n17           c     (     )    17    87\n18           c     (     )    14    90\n19           c     (     )    13    91\n20           c     (     )    11    93\n21           c     (     )    10    94\n22           c     (     )     5    96\n23           c     (     )     4    97\n24           c     (     )     3    98\n25           c     (     )     2    99\n26           c     (     )     1   100\n\n[27 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Contrafold is also available for computations on my own server."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.computations import Contrafold\ncontrafold = Contrafold(rest_server = rest_server, api_key = api_key)\nbase_pairs = contrafold.fold(rna)\nprint base_pairs",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "   orientation edge1 edge2  pos1  pos2\n0            c     (     )    41    48\n1            c     (     )    40    49\n2            c     (     )    39    50\n3            c     (     )    38    51\n4            c     (     )    37    52\n5            c     (     )    70    71\n6            c     (     )    69    72\n7            c     (     )    68    73\n8            c     (     )    67    74\n9            c     (     )    36    76\n10           c     (     )    35    77\n11           c     (     )    26    78\n12           c     (     )    24    80\n13           c     (     )    23    81\n14           c     (     )    19    84\n15           c     (     )    18    85\n16           c     (     )    17    86\n17           c     (     )    16    87\n18           c     (     )    12    89\n19           c     (     )    11    90\n20           c     (     )    10    94\n21           c     (     )     5    96\n22           c     (     )     4    97\n23           c     (     )     3    98\n24           c     (     )     2    99\n25           c     (     )     1   100\n\n[26 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "We can also compute the 2D coordinates for a plot."
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": "from pyrna.computations import Rnaplot\nrnaplot = Rnaplot(api_key = api_key)\nplot = rnaplot.plot(secondary_structure = base_pairs, rna = rna)\nprint plot",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "          x        y\n1    65.468   -7.471\n2    65.468    7.529\n3    65.468   22.529\n4    65.468   37.529\n5    65.468   52.529\n6    58.013   57.898\n7    53.897   66.267\n8    54.204   75.728\n9    58.994   84.054\n10   67.254   89.223\n11   67.954  104.207\n12   67.954  119.207\n13   59.962  125.915\n14   57.013  136.155\n15   60.348  146.509\n16   69.069  153.376\n17   70.002  168.347\n18   70.934  183.318\n19   71.867  198.289\n20   62.010  206.746\n21   58.830  219.541\n22   63.724  231.993\n23   75.092  239.412\n24   76.490  254.347\n25   69.972  267.739\n26   78.962  280.759\n27   67.122  271.549\n28   52.140  270.819\n29   39.461  278.834\n30   33.693  292.680\n31   36.933  307.326\n32   48.004  317.448\n33   62.881  319.366\n34   76.157  312.384\n35   83.006  299.040\n36   88.039  313.170\n37   87.806  328.168\n38   82.336  342.135\n39   76.866  356.103\n40   71.396  370.070\n41   65.927  384.037\n42   51.347  390.137\n43   45.429  404.791\n44   51.683  419.306\n45   66.399  425.069\n46   80.846  418.662\n47   86.453  403.885\n48   79.894  389.507\n49   85.364  375.539\n50   90.833  361.572\n51   96.303  347.605\n52  101.773  333.638\n53  103.669  348.518\n54  105.565  363.397\n55  101.785  377.913\n56  104.727  392.622\n57  113.801  404.566\n58  127.182  411.345\n59  142.180  411.596\n60  155.780  405.269\n        ...      ...\n\n[100 rows x 2 columns]\n"
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "We can also compute a secondary structure by annotating a tertiary one. First, we need to import a 3D structure from the Protein Databank."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "from pyrna.db import PDB\nfrom pyrna.parsers import parse_pdb\npdb = PDB()\ntertiary_structures = parse_pdb(pdb.get_entry('1HR2'))",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "With PyRNA, a pyrna.features.TertiaryStructure object is made with a single molecular chain. Consequently, the function parse_pdb returns a list of such objects. We can iterate over these pyrna.features.TertiaryStructure objects to annotate them with the algorithm Rnaview (Rnaview is available for computations from my own server)."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "secondary_structures = []\nfor ts in tertiary_structures:\n    #the function annotate() from Rnaview returns a pyrna.features.SecondaryStructure object and its 3D counterpart as a pyrna.features.TertiaryStructure object\n    secondary_structure, tertiary_structure = Rnaview(rest_server = rest_server, api_key = api_key).annotate(ts)\n    secondary_structures.append(secondary_structure)\n    #the function secondary_structure_to_base_pairs() transform a pyrna.features.SecondaryStructure object into a list of base pairs stored in a pandas Dataframe\n    print \"Molecular chain %s\"%secondary_structure.rna.name\n    print secondary_structure_to_base_pairs(secondary_structure, keep_tertiaries = True)",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Molecular chain A\n   edge1 edge2 orientation  pos1  pos2\n0      (     )           c     4   112\n1      (     )           c     5   111\n2      (     )           c     6   110\n3      (     )           c     7   109\n4      (     )           c     8   108\n5      (     )           c     9   107\n6      (     )           c    10   106\n7      {     ]           t    11   105\n8      [     }           t    12   104\n9      (     )           c    14   103\n10     (     )           c    15   102\n11     (     )           c    16   101\n12     (     )           c    17   100\n13     (     )           c    18    99\n14     (     )           c    19    98\n15     (     )           c    24    94\n16     (     )           c    25    93\n17     (     )           c    26    92\n18     (     )           c    27    91\n19     (     )           c    30    89\n20     (     )           c    31    88\n21     (     )           c    32    87\n22     (     )           c    34    80\n23     (     )           c    35    79\n24     (     )           c    36    78\n25     [     }           t    37    62\n26     [     }           t    38    61\n27     (     )           c    39    60\n28     (     )           c    40    59\n29     (     )           c    41    58\n30     (     )           c    42    57\n31     (     )           c    43    56\n32     (     )           c    44    55\n33     (     )           c    45    54\n34     (     )           c    46    53\n35     (     )           c    47    52\n36     (     )           c    63    73\n37     (     )           c    64    72\n38     !     !           !    65    71\n39     (     )           c   117   150\n40     (     )           c   118   149\n41     (     )           c   119   148\n42     (     )           c   120   147\n43     (     )           c   124   144\n44     (     )           c   125   143\n45     (     )           c   126   142\n46     (     )           c   127   141\n47     (     )           c   128   140\n48     (     )           c   129   139\n49     (     )           c   130   138\n50     (     )           c   131   137\n51     (     )           c   132   136\n52     [     )           c     4   156\n53     (     ]           t    21    96\n54     (     }           t    22    99\n55     (     ]           c    33    85\n56     {     ]           t    48    51\n57     (     )           t    49   145\n58     {     }           t    51   147\n59     [     )           t    66    86\n     ...   ...         ...   ...   ...\n\n[83 rows x 5 columns]\nMolecular chain B"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "\n   edge1 edge2 orientation  pos1  pos2\n0      (     )           c     2   115\n1      (     )           c     3   114\n2      (     )           c     4   113\n3      (     )           c     6   112\n4      (     )           c     7   111\n5      (     )           c     8   110\n6      (     )           c     9   109\n7      (     )           c    10   108\n8      (     )           c    11   107\n9      {     ]           t    12   106\n10     [     }           t    13   105\n11     (     )           c    15   104\n12     (     )           c    16   103\n13     (     )           c    17   102\n14     (     )           c    18   101\n15     (     )           c    19   100\n16     (     )           c    20    99\n17     (     )           c    25    95\n18     (     )           c    26    94\n19     (     )           c    27    93\n20     (     )           c    28    92\n21     (     )           c    30    91\n22     (     )           c    31    90\n23     (     )           c    32    89\n24     (     )           c    33    88\n25     (     )           c    35    81\n26     (     )           c    36    80\n27     (     )           c    37    79\n28     [     }           t    38    63\n29     [     }           t    39    62\n30     (     )           c    40    61\n31     (     )           c    41    60\n32     (     )           c    42    59\n33     (     )           c    43    58\n34     (     )           c    44    57\n35     (     )           c    45    56\n36     (     )           c    46    55\n37     (     )           c    47    54\n38     (     )           c    48    53\n39     (     )           c    64    74\n40     (     )           c    65    73\n41     !     !           !    66    72\n42     (     )           c   118   151\n43     (     )           c   119   150\n44     (     )           c   120   149\n45     (     )           c   121   148\n46     (     )           c   125   145\n47     (     )           c   126   144\n48     (     )           c   127   143\n49     (     )           c   128   142\n50     (     )           c   129   141\n51     (     )           c   130   140\n52     (     )           c   131   139\n53     (     )           c   132   138\n54     (     )           c   133   137\n55     (     }           c     1   118\n56     [     )           c     2   154\n57     [     )           c     4   156\n58     [     )           c     6   157\n59     [     )           c     7   158\n     ...   ...         ...   ...   ...\n\n[89 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": "Mining of a Secondary Structure"
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now we can search for all base pairs made with at least one Hoogsteen edge. Since the variable base_pairs stores a pandas Dataframe, we're using the operators available from pandas (http://pandas.pydata.org/pandas-docs/stable/indexing.html)"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "base_pairs = secondary_structure_to_base_pairs(secondary_structures[0], keep_tertiaries = True)\nprint base_pairs[(base_pairs['edge1'] == '[') | (base_pairs['edge2'] == ']')]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "   edge1 edge2 orientation  pos1  pos2\n7      {     ]           t    11   105\n8      [     }           t    12   104\n25     [     }           t    37    62\n26     [     }           t    38    61\n52     [     )           c     4   156\n53     (     ]           t    21    96\n55     (     ]           c    33    85\n56     {     ]           t    48    51\n59     [     )           t    66    86\n60     {     ]           c    69    70\n64     {     ]           c   114   115\n65     (     ]           t   121   145\n66     {     ]           c   122   123\n\n[13 rows x 5 columns]\n"
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "A pyrna.features.SecondaryStructure can find all its junctions (meaning the group of single-strands linking helices). \n\nAccording to the number of single-strands making a junction, we distinguish:\n    \n* an apical looop (a single single-strand)\n* an inner loop (two single-strands)\n* an three-way junction (three single-strands)\n* a four-way junction (four single-strands)\n* a n-way junction (n single-strands)"
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "junctions = []\nfor ss in secondary_structures:\n    ss.find_junctions()\n    junctions += ss.junctions",
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "A junction is described with:\n\n* a list of single-strands making the junction\n* a location: the positions in sequence of the residues \n* a description: a string made with the single-strand sequences joined with single space characters. In the description of a junction, the character '-' means a direct link (phosphodiester bond) between two helices. Consequently, we have no single-strand linking these helices."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "print junctions[0]",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "{'single_strands': [{'name': 'SS_2', 'location': [13, 13]}], 'description': 'A -', 'location': [[12, 14], [103, 104]]}\n"
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Now we can search for all GNRA apical loops..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "import re\nfor junction in junctions:\n    if re.match('G[AUGC][AG]A',junction['description']):\n        print \"Apical loop with sequence %s\"%junction['description']",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Apical loop with sequence GAAA\nApical loop with sequence GAAA\n"
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": "Or for all three-way junctions..."
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": "for junction in junctions:\n    if len(junction['location']) == 3:\n        print \"Three-way junctions with sequence %s\"%junction['description']",
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "Three-way junctions with sequence GUAU - -\nThree-way junctions with sequence GUAU - -\n"
      }
     ],
     "prompt_number": 25
    }
   ],
   "metadata": {}
  }
 ]
}