dandanxu/2015.04.07 ClinVar Demo.ipynb Secret

## 2015.04.07 ClinVar Demo.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:feb89eea1576c3d4de4a1cc11e13606e2ebdbe1efd8fcdd714b1e36d4b578e6e"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Getting Started With SolveBio and ClinVar\n",
      "This demo will demonstrate how to use SolveBio to programmatically access ClinVar records and pull out individual submission details and pubmed IDs where available. To get started with SolveBio, just sign up https://www.solvebio.com/signup (it's free), and install our Python and/or Ruby clients docs.solvebio.com/v1.0/docs/installation."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from solvebio import Dataset, Filter"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "See https://www.solvebio.com/library/ClinVar for documentation for each of these datasets. The clinvar and submissions dataset comes from the ClinVar XML, the variants dataset comes from the ClinVar VCF. There are occasionally slight differences in genomic coordinates for variants between these two formats (in the source data)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clinvar = Dataset.retrieve('Clinvar/Clinvar')\n",
      "variants = Dataset.retrieve('Clinvar/Variants')\n",
      "submissions = Dataset.retrieve('Clinvar/Submissions')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can query all of our datasets with `genomic_coordinates` by range. See http://docs.solvebio.com/v1.0/docs/tutorial for documentation. GRCh37/hg19 is the default genome build when none is specificed. GRCh38/hg38 and NCBI36/hg18 are also supported when documented in the Data Library https://www.solvebio.com/library. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clinvar.query(genome_build='GRCh37').range(1,156104629,156104629, exact=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "\n",
        "|                   Fields | Data                                              |\n",
        "|--------------------------+---------------------------------------------------|\n",
        "|             age_of_onset |                                                   |\n",
        "|            allele_origin | germline                                          |\n",
        "|           assertion_type | variation to disease                              |\n",
        "|    clinical_significance | Pathogenic                                        |\n",
        "|     cytogenetic_location | 1q22                                              |\n",
        "|             date_created | 2013-09-30                                        |\n",
        "|      date_evaluated_last | 2013-09-19                                        |\n",
        "|             date_updated | 2015-03-23                                        |\n",
        "|      disease_description | LMNA-related dilated cardiomyopathy (DCM) is  ... |\n",
        "|        disease_mechanism |                                                   |\n",
        "|             disease_name | Dilated cardiomyopathy 1A                         |\n",
        "|   disease_name_alternate | [u'CARDIOMYOPATHY, CONGESTIVE', u'CARDIOMYOPATHY, |\n",
        "|       disease_prevalence |                                                   |\n",
        "|           disease_symbol | CMD1A                                             |\n",
        "| disease_symbol_alternate | [u'IDC', u'CDCD1', u'DCM']                        |\n",
        "|           entrez_id_gene | [u'4000']                                         |\n",
        "|              gene_symbol | [u'LMNA']                                         |\n",
        "|      genomic_coordinates | {u'start': 156104629, u'stop': 156104629, u'build'|\n",
        "|                     hgvs | [u'p.Arg225X', u'NM_170707.2:c.673C>T', u'LRG_254p|\n",
        "|              hgvs_refseq | NM_005572.3:c.673C>T                              |\n",
        "|         location_genbank | NM_170707.2:EXON 4                                |\n",
        "|      mode_of_inheritance | [u'Autosomal dominant inheritance']               |\n",
        "|    molecular_consequence | nonsense                                          |\n",
        "|                  omim_id |                                                   |\n",
        "|                pubmed_id |                                                   |\n",
        "|            rcv_accession | RCV000056001                                      |\n",
        "|       rcv_accession_full | RCV000056001.3                                    |\n",
        "|    rcv_accession_version | 3                                                 |\n",
        "|            record_status | current                                           |\n",
        "|            review_status | classified by multiple submitters                 |\n",
        "|       review_status_star | 2                                                 |\n",
        "|                    rs_id | [u'rs60682848']                                   |\n",
        "|            scv_accession | [u'SCV000065052', u'SCV000087057']                |\n",
        "|        sequence_ontology | SO:0001587                                        |\n",
        "|                    title | NM_005572.3(LMNA):c.673C>T (p.Arg225Ter) AND Dilat|\n",
        "|             variant_type | single nucleotide variant                         |\n",
        "\n",
        "... 1 more results."
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we're going to demonstrate, for one specific ClinVar record (the first one in the query list), how to get the scv accessions and then open those details up."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "scvs = clinvar.query(genome_build='GRCh37').range(1,156104629,156104629, exact=True)[0].get('scv_accession')\n",
      "print scvs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'SCV000065052', u'SCV000087057']\n"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "submissions.query().filter(scv_accession__in=scvs)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 26,
       "text": [
        "\n",
        "|                        Fields | Data                                         |\n",
        "|-------------------------------+----------------------------------------------|\n",
        "|                assertion_type | variation to disease                         |\n",
        "|         clinical_significance | Pathogenic                                   |\n",
        "| clinical_significance_comment | The Arg225X variant in LMNA leads to a p ... |\n",
        "|           date_evaluated_last | 2012-08-15                                   |\n",
        "|                date_submitted | 2015-01-29                                   |\n",
        "|                  date_updated | 2015-02-28                                   |\n",
        "|                  disease_name | Cardiomyopathy, dilated, 1A                  |\n",
        "|                      evidence | [{u'origin': u'germline', u'species': u'human|\n",
        "|                          hgvs | [u'NC_000001.10:g.156104629C>T']             |\n",
        "|                     pubmed_id |                                              |\n",
        "|                 record_status | current                                      |\n",
        "|                 review_status | classified by single submitter               |\n",
        "|                 scv_accession | SCV000065052                                 |\n",
        "|            scv_accession_full | SCV000065052.2                               |\n",
        "|         scv_accession_version | 2                                            |\n",
        "|                     submitter | Laboratory for Molecular Medicine,Partners He|\n",
        "|                  submitter_id | 21766                                        |\n",
        "|                         title |                                              |\n",
        "\n",
        "... 1 more results."
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pubmed_ids = [submission_record.get('pubmed_id') for submission_record in submissions.query().filter(scv_accession__in=scvs)\n",
      "              if submission_record.get('pubmed_id') is not None]\n",
      "print pubmed_ids"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[]\n"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "These particular records did not have any pubmed ids associated with it, but you can see how you can easily and programmatically get those details. There's lots you can do with SolveBio! Contact us - [dandan@solvebio.com](mailto:dandan@solvebio.com) for more info."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:feb89eea1576c3d4de4a1cc11e13606e2ebdbe1efd8fcdd714b1e36d4b578e6e"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Getting Started With SolveBio and ClinVar\n",
	"This demo will demonstrate how to use SolveBio to programmatically access ClinVar records and pull out individual submission details and pubmed IDs where available. To get started with SolveBio, just sign up https://www.solvebio.com/signup (it's free), and install our Python and/or Ruby clients docs.solvebio.com/v1.0/docs/installation."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from solvebio import Dataset, Filter"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 19
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"See https://www.solvebio.com/library/ClinVar for documentation for each of these datasets. The clinvar and submissions dataset comes from the ClinVar XML, the variants dataset comes from the ClinVar VCF. There are occasionally slight differences in genomic coordinates for variants between these two formats (in the source data)."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"clinvar = Dataset.retrieve('Clinvar/Clinvar')\n",
	"variants = Dataset.retrieve('Clinvar/Variants')\n",
	"submissions = Dataset.retrieve('Clinvar/Submissions')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 22
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can query all of our datasets with `genomic_coordinates` by range. See http://docs.solvebio.com/v1.0/docs/tutorial for documentation. GRCh37/hg19 is the default genome build when none is specificed. GRCh38/hg38 and NCBI36/hg18 are also supported when documented in the Data Library https://www.solvebio.com/library. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"clinvar.query(genome_build='GRCh37').range(1,156104629,156104629, exact=True)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 23,
	"text": [
	"\n",
	"\| Fields \| Data \|\n",
	"\|--------------------------+---------------------------------------------------\|\n",
	"\| age_of_onset \| \|\n",
	"\| allele_origin \| germline \|\n",
	"\| assertion_type \| variation to disease \|\n",
	"\| clinical_significance \| Pathogenic \|\n",
	"\| cytogenetic_location \| 1q22 \|\n",
	"\| date_created \| 2013-09-30 \|\n",
	"\| date_evaluated_last \| 2013-09-19 \|\n",
	"\| date_updated \| 2015-03-23 \|\n",
	"\| disease_description \| LMNA-related dilated cardiomyopathy (DCM) is ... \|\n",
	"\| disease_mechanism \| \|\n",
	"\| disease_name \| Dilated cardiomyopathy 1A \|\n",
	"\| disease_name_alternate \| [u'CARDIOMYOPATHY, CONGESTIVE', u'CARDIOMYOPATHY, \|\n",
	"\| disease_prevalence \| \|\n",
	"\| disease_symbol \| CMD1A \|\n",
	"\| disease_symbol_alternate \| [u'IDC', u'CDCD1', u'DCM'] \|\n",
	"\| entrez_id_gene \| [u'4000'] \|\n",
	"\| gene_symbol \| [u'LMNA'] \|\n",
	"\| genomic_coordinates \| {u'start': 156104629, u'stop': 156104629, u'build'\|\n",
	"\| hgvs \| [u'p.Arg225X', u'NM_170707.2:c.673C>T', u'LRG_254p\|\n",
	"\| hgvs_refseq \| NM_005572.3:c.673C>T \|\n",
	"\| location_genbank \| NM_170707.2:EXON 4 \|\n",
	"\| mode_of_inheritance \| [u'Autosomal dominant inheritance'] \|\n",
	"\| molecular_consequence \| nonsense \|\n",
	"\| omim_id \| \|\n",
	"\| pubmed_id \| \|\n",
	"\| rcv_accession \| RCV000056001 \|\n",
	"\| rcv_accession_full \| RCV000056001.3 \|\n",
	"\| rcv_accession_version \| 3 \|\n",
	"\| record_status \| current \|\n",
	"\| review_status \| classified by multiple submitters \|\n",
	"\| review_status_star \| 2 \|\n",
	"\| rs_id \| [u'rs60682848'] \|\n",
	"\| scv_accession \| [u'SCV000065052', u'SCV000087057'] \|\n",
	"\| sequence_ontology \| SO:0001587 \|\n",
	"\| title \| NM_005572.3(LMNA):c.673C>T (p.Arg225Ter) AND Dilat\|\n",
	"\| variant_type \| single nucleotide variant \|\n",
	"\n",
	"... 1 more results."
	]
	}
	],
	"prompt_number": 23
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we're going to demonstrate, for one specific ClinVar record (the first one in the query list), how to get the scv accessions and then open those details up."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"scvs = clinvar.query(genome_build='GRCh37').range(1,156104629,156104629, exact=True)[0].get('scv_accession')\n",
	"print scvs"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[u'SCV000065052', u'SCV000087057']\n"
	]
	}
	],
	"prompt_number": 33
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"submissions.query().filter(scv_accession__in=scvs)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 26,
	"text": [
	"\n",
	"\| Fields \| Data \|\n",
	"\|-------------------------------+----------------------------------------------\|\n",
	"\| assertion_type \| variation to disease \|\n",
	"\| clinical_significance \| Pathogenic \|\n",
	"\| clinical_significance_comment \| The Arg225X variant in LMNA leads to a p ... \|\n",
	"\| date_evaluated_last \| 2012-08-15 \|\n",
	"\| date_submitted \| 2015-01-29 \|\n",
	"\| date_updated \| 2015-02-28 \|\n",
	"\| disease_name \| Cardiomyopathy, dilated, 1A \|\n",
	"\| evidence \| [{u'origin': u'germline', u'species': u'human\|\n",
	"\| hgvs \| [u'NC_000001.10:g.156104629C>T'] \|\n",
	"\| pubmed_id \| \|\n",
	"\| record_status \| current \|\n",
	"\| review_status \| classified by single submitter \|\n",
	"\| scv_accession \| SCV000065052 \|\n",
	"\| scv_accession_full \| SCV000065052.2 \|\n",
	"\| scv_accession_version \| 2 \|\n",
	"\| submitter \| Laboratory for Molecular Medicine,Partners He\|\n",
	"\| submitter_id \| 21766 \|\n",
	"\| title \| \|\n",
	"\n",
	"... 1 more results."
	]
	}
	],
	"prompt_number": 26
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"pubmed_ids = [submission_record.get('pubmed_id') for submission_record in submissions.query().filter(scv_accession__in=scvs)\n",
	" if submission_record.get('pubmed_id') is not None]\n",
	"print pubmed_ids"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"[]\n"
	]
	}
	],
	"prompt_number": 32
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"These particular records did not have any pubmed ids associated with it, but you can see how you can easily and programmatically get those details. There's lots you can do with SolveBio! Contact us - [dandan@solvebio.com](mailto:dandan@solvebio.com) for more info."
	]
	}
	],
	"metadata": {}
	}
	]
	}