johnsolk/agalma_tutorial_March2018.ipynb

## agalma_tutorial_March2018.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring a phylotranscriptmics workflow with [agalma](https://bitbucket.org/caseywdunn/agalma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Agalma tutorial](https://bitbucket.org/caseywdunn/agalma/src/master/TUTORIAL.md) by the [Dunn lab](http://dunnlab.org/)\n",
    "* Followed \"Quick Install - Anaconda Python\" [Installation instructions](https://bitbucket.org/caseywdunn/agalma)\n",
    "* Started an `m1.medium` instance (CPU: 6, Mem: 16 GB, Disk: 60 GB) with Ubuntu 16.04 on [Jetstream](https://use.jetstream-cloud.org/application/images)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install [jupyter notebook](https://github.com/ngs-docs/2018-ggg201b/blob/master/lab5-assembly-eval/README.md) on the instance\n",
    "\n",
    "```\n",
    "pip install jupyter\n",
    "```\n",
    "Then\n",
    "```\n",
    "jupyter notebook --generate-config\n",
    "```\n",
    "Then generate a config file. (Note: this password protects the notebook.)\n",
    "```\n",
    "cat >> ~/.jupyter/jupyter_notebook_config.py <<EOF\n",
    "c = get_config()\n",
    "c.NotebookApp.ip = '*'\n",
    "c.NotebookApp.open_browser = False\n",
    "c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'\n",
    "c.NotebookApp.port = 8000\n",
    "\n",
    "EOF\n",
    "```\n",
    "Now, run it!\n",
    "\n",
    "```\n",
    "jupyter notebook &\n",
    "\n",
    "```\n",
    "(Press `Enter`, will return to commandline)\n",
    "\n",
    "You can figure out what Web address to connect to this way:\n",
    "\n",
    "```\n",
    "echo http://$(hostname):8000/\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test to make sure agalma installation worked:\n",
    "```\n",
    "mkdir ~/tmp\n",
    "cd ~/tmp\n",
    "agalma test\n",
    "```\n",
    "[Test ran successfully.](https://gist.github.com/ljcohen/0aa14000bfbfec62fffe4893684e1bb6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Run the [agalma tutorial](https://bitbucket.org/caseywdunn/agalma/src/master/TUTORIAL.md?fileviewer=file-view-default)\n",
    "Download test data:\n",
    "```\n",
    "cd agalma/data\n",
    "agalma testdata\n",
    "```\n",
    "This output filenames. \n",
    "\n",
    "Default threads and memory on the `m1.medium` machine:\n",
    "```\n",
    "agalma -t 4 -m 14G\n",
    "```\n",
    "Then proceeded with tutorial steps:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "HWI-ST625-73-C0JUVACXX-7 [2018-03-05 20:43:28]\n",
      "/home/ljcohen/SRX288285_1.fq (17.3 MB)\n",
      "/home/ljcohen/SRX288285_2.fq (17.3 MB)\n",
      "  species: Agalma elegans\n",
      "  ncbi_id: 316166\n",
      "  itis_id: 51383\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n"
     ]
    }
   ],
   "source": [
    "!agalma catalog insert --paths SRX288285_1.fq SRX288285_2.fq --species \"Agalma elegans\" --ncbi_id 316166 --itis_id 51383"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-1'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / qc.setup_data / 0.189s / 110.4MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / qc.fastqc / 0.191s / 110.6MB\n",
      "  Generate FastQC reports for each FASTQ file\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / qc.parse / 8.276s / 543.4MB\n",
      "  Parse FastQC reports into the database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 8.290s / 547.2MB\n"
     ]
    }
   ],
   "source": [
    "!cd ~/agalma/scratch\n",
    "!agalma qc --id HWI-ST625-73-C0JUVACXX-7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transcriptome subset assembly and exmplar contig identification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-2'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / insert_size.setup_data / -0.552s / 124.0MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / insert_size.assemble_subset / -0.550s / 124.0MB\n",
      "  Assemble a subset of high quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / insert_size.estimate_insert / 51.403s / 1515.7MB\n",
      "  Estimate insert size by mapping the subset against the assembly\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / rrna.assemble_subsets / 57.108s / 1516.1MB\n",
      "  Assemble subsets of increasing numbers of reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / rrna.blast_transcripts / 182.927s / 1517.6MB\n",
      "  Blast transcripts against known rRNA database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / rrna.find_exemplars / 183.321s / 1517.7MB\n",
      "  Parse blast output for exemplar rRNA sequences\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
      "agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
      "agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / rrna.map_reads / 183.344s / 1517.9MB\n",
      "  Map reads against rRNA exemplars\n",
      "agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 7 / rrna.exclude_reads / 183.346s / 1517.9MB\n",
      "  Exclude pairs where either read maps to an rRNA exemplar\n",
      "agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 8 / transcriptome.assemble_connector / 183.348s / 1517.9MB\n",
      "  [connector between \"rrna\" and \"assemble\"]\n",
      "biolite.pipeline.run: \n",
      "  STAGE 9 / assemble.setup_rrna / 183.349s / 1517.9MB\n",
      "  Retrieve the rRNA exemplars from the database\n",
      "agalma.assemble.setup_rrna: no previous rrna run found for id HWI-ST625-73-C0JUVACXX-7\n",
      "biolite.pipeline.run: \n",
      "  STAGE 10 / assemble.filter_data / 183.351s / 1518.0MB\n",
      "  Filter out low-quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 11 / assemble.assemble / 186.372s / 1518.0MB\n",
      "  Assemble the filtered reads with Trinity\n",
      "biolite.pipeline.run: \n",
      "  STAGE 12 / assemble.parse_assembly / 258.908s / 1576.5MB\n",
      "  Parse the assembly into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  STAGE 13 / assemble.remove_vectors / 258.922s / 1577.3MB\n",
      "  Remove vector contaminants with UniVec\n",
      "biolite.utils.safe_mkdir: creating directory 'univec'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-2/univec' already exists\n",
      "agalma.assemble.remove_vectors: found 0 vector contaminants\n",
      "biolite.pipeline.run: \n",
      "  STAGE 14 / assemble.remove_rrna / 259.470s / 1577.5MB\n",
      "  Remove rRNA using curated and exemplar sequences\n",
      "biolite.utils.safe_mkdir: creating directory 'rrna'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-2/rrna' already exists\n",
      "agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
      "biolite.pipeline.run: \n",
      "  STAGE 15 / assemble.estimate_confidence / 260.139s / 1577.5MB\n",
      "  Estimate coverage and confidence values for each transcript\n",
      "biolite.pipeline.run: \n",
      "  STAGE 16 / assemble.parse_confidence / 268.327s / 1577.5MB\n",
      "  Parse estimated confidence scores and update database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 17 / transcriptome.write_sequences / 268.331s / 1577.5MB\n",
      "  Write assembled sequences to FASTA\n",
      "biolite.pipeline.run: \n",
      "  STAGE 18 / translate.identify_orfs / 268.334s / 1577.5MB\n",
      "  Identify long open reading frames\n",
      "biolite.pipeline.run: \n",
      "  STAGE 19 / translate.annotate_orfs / 268.671s / 1577.5MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-2/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 20 / translate.select_orfs / 350.139s / 1577.5MB\n",
      "  Select the open reading frame with the best evalue\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 350.153s / 1577.5MB\n"
     ]
    }
   ],
   "source": [
    "!agalma transcriptome --id HWI-ST625-73-C0JUVACXX-7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/css'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/img'\n",
      "agalma.agalma_report.report_runs: 1 has pipelines: qc\n",
      "agalma.agalma_report.report_runs: added qc report for 1\n",
      "agalma.agalma_report.report_runs: 2 has pipelines: assemble,translate,rrna,transcriptome,insert_size\n",
      "agalma.agalma_report.report_runs: added insert_size report for 2\n",
      "agalma.agalma_report.report_runs: added rrna report for 2\n",
      "agalma.agalma_report.report_runs: added assemble report for 2\n",
      "agalma.agalma_report.report_runs: added translate report for 2\n"
     ]
    }
   ],
   "source": [
    "!agalma report --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: directory 'reports/HWI-ST625-73-C0JUVACXX-7' already exists\n",
      "/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'Arial'] not found. Falling back to DejaVu Sans\n",
      "  (prop.get_family(), self.defaultFamily[fontext]))\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/css' already exists\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/img' already exists\n"
     ]
    }
   ],
   "source": [
    "!agalma resources --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "SRX288285 [2018-03-05 21:17:25]\n",
      "/home/ljcohen/SRX288285_1.fq (17.3 MB)\n",
      "/home/ljcohen/SRX288285_2.fq (17.3 MB)\n",
      "  species: Agalma elegans\n",
      "  ncbi_id: 316166\n",
      "  itis_id: None\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n",
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "SRX288432 [2018-03-05 21:17:26]\n",
      "/home/ljcohen/SRX288432_1.fq (2.5 MB)\n",
      "/home/ljcohen/SRX288432_2.fq (2.5 MB)\n",
      "  species: Craseoa lathetica\n",
      "  ncbi_id: 316205\n",
      "  itis_id: None\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n",
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "SRX288431 [2018-03-05 21:17:26]\n",
      "/home/ljcohen/SRX288431_1.fq (252 KB)\n",
      "/home/ljcohen/SRX288431_2.fq (252 KB)\n",
      "  species: Physalia physalis\n",
      "  ncbi_id: 168775\n",
      "  itis_id: None\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n",
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "SRX288430 [2018-03-05 21:17:27]\n",
      "/home/ljcohen/SRX288430_1.fq (757 KB)\n",
      "/home/ljcohen/SRX288430_2.fq (757 KB)\n",
      "  species: Nanomia bijuga\n",
      "  ncbi_id: 168759\n",
      "  itis_id: None\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n",
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "JGI_NEMVEC [2018-03-05 21:17:28]\n",
      "/home/ljcohen/JGI_NEMVEC.fa (22 KB)\n",
      "  species: Nematostella vectensis\n",
      "  ncbi_id: 45351\n",
      "  itis_id: None\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n",
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "NCBI_HYDMAG [2018-03-05 21:17:29]\n",
      "/home/ljcohen/NCBI_HYDMAG.pfa (7 KB)\n",
      "  species: Hydra magnipapillata\n",
      "  ncbi_id: 6085\n",
      "  itis_id: None\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: None\n",
      "  treatment: None\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n"
     ]
    }
   ],
   "source": [
    "!cd ~/agalma/data\n",
    "!agalma catalog insert --id SRX288285 --paths SRX288285_1.fq SRX288285_2.fq --species \"Agalma elegans\" --ncbi_id 316166\n",
    "!agalma catalog insert --id SRX288432 --paths SRX288432_1.fq SRX288432_2.fq --species \"Craseoa lathetica\" --ncbi_id 316205\n",
    "!agalma catalog insert --id SRX288431 --paths SRX288431_1.fq SRX288431_2.fq --species \"Physalia physalis\" --ncbi_id 168775\n",
    "!agalma catalog insert --id SRX288430 --paths SRX288430_1.fq SRX288430_2.fq --species \"Nanomia bijuga\" --ncbi_id 168759\n",
    "!agalma catalog insert --id JGI_NEMVEC --paths JGI_NEMVEC.fa --species \"Nematostella vectensis\" --ncbi_id 45351\n",
    "!agalma catalog insert --id NCBI_HYDMAG --paths NCBI_HYDMAG.pfa --species \"Hydra magnipapillata\" --ncbi_id 6085"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-3'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / insert_size.setup_data / 0.127s / 123.8MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / insert_size.assemble_subset / 0.130s / 124.0MB\n",
      "  Assemble a subset of high quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / insert_size.estimate_insert / 68.322s / 1568.6MB\n",
      "  Estimate insert size by mapping the subset against the assembly\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / rrna.assemble_subsets / 74.084s / 1568.9MB\n",
      "  Assemble subsets of increasing numbers of reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / rrna.blast_transcripts / 201.843s / 1569.4MB\n",
      "  Blast transcripts against known rRNA database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / rrna.find_exemplars / 202.224s / 1569.4MB\n",
      "  Parse blast output for exemplar rRNA sequences\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
      "agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
      "agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / rrna.map_reads / 202.258s / 1569.7MB\n",
      "  Map reads against rRNA exemplars\n",
      "agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 7 / rrna.exclude_reads / 202.260s / 1569.7MB\n",
      "  Exclude pairs where either read maps to an rRNA exemplar\n",
      "agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 8 / transcriptome.assemble_connector / 202.263s / 1569.7MB\n",
      "  [connector between \"rrna\" and \"assemble\"]\n",
      "biolite.pipeline.run: \n",
      "  STAGE 9 / assemble.setup_rrna / 202.264s / 1569.7MB\n",
      "  Retrieve the rRNA exemplars from the database\n",
      "agalma.assemble.setup_rrna: no previous rrna run found for id SRX288285\n",
      "biolite.pipeline.run: \n",
      "  STAGE 10 / assemble.filter_data / 202.267s / 1569.7MB\n",
      "  Filter out low-quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 11 / assemble.assemble / 205.101s / 1569.8MB\n",
      "  Assemble the filtered reads with Trinity\n",
      "biolite.pipeline.run: \n",
      "  STAGE 12 / assemble.parse_assembly / 255.274s / 1569.8MB\n",
      "  Parse the assembly into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  STAGE 13 / assemble.remove_vectors / 255.288s / 1570.7MB\n",
      "  Remove vector contaminants with UniVec\n",
      "biolite.utils.safe_mkdir: creating directory 'univec'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-3/univec' already exists\n",
      "agalma.assemble.remove_vectors: found 0 vector contaminants\n",
      "biolite.pipeline.run: \n",
      "  STAGE 14 / assemble.remove_rrna / 255.858s / 1570.9MB\n",
      "  Remove rRNA using curated and exemplar sequences\n",
      "biolite.utils.safe_mkdir: creating directory 'rrna'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-3/rrna' already exists\n",
      "agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
      "biolite.pipeline.run: \n",
      "  STAGE 15 / assemble.estimate_confidence / 256.476s / 1570.9MB\n",
      "  Estimate coverage and confidence values for each transcript\n",
      "biolite.pipeline.run: \n",
      "  STAGE 16 / assemble.parse_confidence / 265.440s / 1570.9MB\n",
      "  Parse estimated confidence scores and update database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 17 / transcriptome.write_sequences / 265.444s / 1570.9MB\n",
      "  Write assembled sequences to FASTA\n",
      "biolite.pipeline.run: \n",
      "  STAGE 18 / translate.identify_orfs / 265.448s / 1570.9MB\n",
      "  Identify long open reading frames\n",
      "biolite.pipeline.run: \n",
      "  STAGE 19 / translate.annotate_orfs / 265.697s / 1570.9MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-3/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 20 / translate.select_orfs / 332.014s / 1570.9MB\n",
      "  Select the open reading frame with the best evalue\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 332.025s / 1570.9MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-4'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / insert_size.setup_data / 0.138s / 124.0MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / insert_size.assemble_subset / 0.141s / 124.2MB\n",
      "  Assemble a subset of high quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / insert_size.estimate_insert / 13.886s / 719.2MB\n",
      "  Estimate insert size by mapping the subset against the assembly\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / rrna.assemble_subsets / 14.671s / 719.5MB\n",
      "  Assemble subsets of increasing numbers of reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / rrna.blast_transcripts / 108.762s / 722.0MB\n",
      "  Blast transcripts against known rRNA database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / rrna.find_exemplars / 109.173s / 722.0MB\n",
      "  Parse blast output for exemplar rRNA sequences\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
      "agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
      "agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / rrna.map_reads / 109.213s / 722.4MB\n",
      "  Map reads against rRNA exemplars\n",
      "agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 7 / rrna.exclude_reads / 109.215s / 722.4MB\n",
      "  Exclude pairs where either read maps to an rRNA exemplar\n",
      "agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 8 / transcriptome.assemble_connector / 109.218s / 722.4MB\n",
      "  [connector between \"rrna\" and \"assemble\"]\n",
      "biolite.pipeline.run: \n",
      "  STAGE 9 / assemble.setup_rrna / 109.219s / 722.4MB\n",
      "  Retrieve the rRNA exemplars from the database\n",
      "agalma.assemble.setup_rrna: no previous rrna run found for id SRX288430\n",
      "biolite.pipeline.run: \n",
      "  STAGE 10 / assemble.filter_data / 109.222s / 722.4MB\n",
      "  Filter out low-quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 11 / assemble.assemble / 109.432s / 722.5MB\n",
      "  Assemble the filtered reads with Trinity\n",
      "biolite.pipeline.run: \n",
      "  STAGE 12 / assemble.parse_assembly / 121.188s / 722.5MB\n",
      "  Parse the assembly into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  STAGE 13 / assemble.remove_vectors / 121.197s / 723.5MB\n",
      "  Remove vector contaminants with UniVec\n",
      "biolite.utils.safe_mkdir: creating directory 'univec'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-4/univec' already exists\n",
      "agalma.assemble.remove_vectors: found 0 vector contaminants\n",
      "biolite.pipeline.run: \n",
      "  STAGE 14 / assemble.remove_rrna / 121.657s / 723.6MB\n",
      "  Remove rRNA using curated and exemplar sequences\n",
      "biolite.utils.safe_mkdir: creating directory 'rrna'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-4/rrna' already exists\n",
      "agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
      "biolite.pipeline.run: \n",
      "  STAGE 15 / assemble.estimate_confidence / 122.221s / 723.7MB\n",
      "  Estimate coverage and confidence values for each transcript\n",
      "biolite.pipeline.run: \n",
      "  STAGE 16 / assemble.parse_confidence / 123.356s / 723.7MB\n",
      "  Parse estimated confidence scores and update database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 17 / transcriptome.write_sequences / 123.360s / 723.7MB\n",
      "  Write assembled sequences to FASTA\n",
      "biolite.pipeline.run: \n",
      "  STAGE 18 / translate.identify_orfs / 123.363s / 723.7MB\n",
      "  Identify long open reading frames\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.pipeline.run: \n",
      "  STAGE 19 / translate.annotate_orfs / 123.616s / 723.7MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-4/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 20 / translate.select_orfs / 156.639s / 723.7MB\n",
      "  Select the open reading frame with the best evalue\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 156.651s / 723.7MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-5'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / insert_size.setup_data / 0.307s / 123.9MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / insert_size.assemble_subset / 0.308s / 124.1MB\n",
      "  Assemble a subset of high quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / insert_size.estimate_insert / 11.024s / 719.2MB\n",
      "  Estimate insert size by mapping the subset against the assembly\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / rrna.assemble_subsets / 11.707s / 719.6MB\n",
      "  Assemble subsets of increasing numbers of reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / rrna.blast_transcripts / 79.731s / 720.1MB\n",
      "  Blast transcripts against known rRNA database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / rrna.find_exemplars / 80.083s / 720.1MB\n",
      "  Parse blast output for exemplar rRNA sequences\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
      "agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
      "agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / rrna.map_reads / 80.102s / 720.3MB\n",
      "  Map reads against rRNA exemplars\n",
      "agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 7 / rrna.exclude_reads / 80.104s / 720.3MB\n",
      "  Exclude pairs where either read maps to an rRNA exemplar\n",
      "agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 8 / transcriptome.assemble_connector / 80.106s / 720.3MB\n",
      "  [connector between \"rrna\" and \"assemble\"]\n",
      "biolite.pipeline.run: \n",
      "  STAGE 9 / assemble.setup_rrna / 80.107s / 720.3MB\n",
      "  Retrieve the rRNA exemplars from the database\n",
      "agalma.assemble.setup_rrna: no previous rrna run found for id SRX288431\n",
      "biolite.pipeline.run: \n",
      "  STAGE 10 / assemble.filter_data / 80.109s / 720.3MB\n",
      "  Filter out low-quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 11 / assemble.assemble / 80.224s / 720.5MB\n",
      "  Assemble the filtered reads with Trinity\n",
      "biolite.pipeline.run: \n",
      "  STAGE 12 / assemble.parse_assembly / 93.242s / 720.5MB\n",
      "  Parse the assembly into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  STAGE 13 / assemble.remove_vectors / 93.250s / 721.6MB\n",
      "  Remove vector contaminants with UniVec\n",
      "biolite.utils.safe_mkdir: creating directory 'univec'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-5/univec' already exists\n",
      "agalma.assemble.remove_vectors: found 0 vector contaminants\n",
      "biolite.pipeline.run: \n",
      "  STAGE 14 / assemble.remove_rrna / 93.776s / 721.7MB\n",
      "  Remove rRNA using curated and exemplar sequences\n",
      "biolite.utils.safe_mkdir: creating directory 'rrna'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-5/rrna' already exists\n",
      "agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
      "biolite.pipeline.run: \n",
      "  STAGE 15 / assemble.estimate_confidence / 94.345s / 721.7MB\n",
      "  Estimate coverage and confidence values for each transcript\n",
      "biolite.pipeline.run: \n",
      "  STAGE 16 / assemble.parse_confidence / 95.076s / 721.7MB\n",
      "  Parse estimated confidence scores and update database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 17 / transcriptome.write_sequences / 95.079s / 721.7MB\n",
      "  Write assembled sequences to FASTA\n",
      "biolite.pipeline.run: \n",
      "  STAGE 18 / translate.identify_orfs / 95.083s / 721.7MB\n",
      "  Identify long open reading frames\n",
      "biolite.pipeline.run: \n",
      "  STAGE 19 / translate.annotate_orfs / 95.258s / 721.7MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-5/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 20 / translate.select_orfs / 112.402s / 721.7MB\n",
      "  Select the open reading frame with the best evalue\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 112.406s / 721.7MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-6'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / insert_size.setup_data / 0.130s / 124.1MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / insert_size.assemble_subset / 0.132s / 124.2MB\n",
      "  Assemble a subset of high quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / insert_size.estimate_insert / 24.007s / 734.4MB\n",
      "  Estimate insert size by mapping the subset against the assembly\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / rrna.assemble_subsets / 25.524s / 734.7MB\n",
      "  Assemble subsets of increasing numbers of reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / rrna.blast_transcripts / 156.028s / 781.3MB\n",
      "  Blast transcripts against known rRNA database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / rrna.find_exemplars / 156.463s / 781.3MB\n",
      "  Parse blast output for exemplar rRNA sequences\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
      "agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
      "agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
      "agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
      "agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / rrna.map_reads / 156.515s / 782.0MB\n",
      "  Map reads against rRNA exemplars\n",
      "agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 7 / rrna.exclude_reads / 156.517s / 782.0MB\n",
      "  Exclude pairs where either read maps to an rRNA exemplar\n",
      "agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 8 / transcriptome.assemble_connector / 156.519s / 782.0MB\n",
      "  [connector between \"rrna\" and \"assemble\"]\n",
      "biolite.pipeline.run: \n",
      "  STAGE 9 / assemble.setup_rrna / 156.520s / 782.0MB\n",
      "  Retrieve the rRNA exemplars from the database\n",
      "agalma.assemble.setup_rrna: no previous rrna run found for id SRX288432\n",
      "biolite.pipeline.run: \n",
      "  STAGE 10 / assemble.filter_data / 156.522s / 782.1MB\n",
      "  Filter out low-quality reads\n",
      "biolite.pipeline.run: \n",
      "  STAGE 11 / assemble.assemble / 156.956s / 782.1MB\n",
      "  Assemble the filtered reads with Trinity\n",
      "biolite.pipeline.run: \n",
      "  STAGE 12 / assemble.parse_assembly / 177.411s / 782.1MB\n",
      "  Parse the assembly into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  STAGE 13 / assemble.remove_vectors / 177.421s / 783.0MB\n",
      "  Remove vector contaminants with UniVec\n",
      "biolite.utils.safe_mkdir: creating directory 'univec'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-6/univec' already exists\n",
      "agalma.assemble.remove_vectors: found 0 vector contaminants\n",
      "biolite.pipeline.run: \n",
      "  STAGE 14 / assemble.remove_rrna / 177.858s / 783.2MB\n",
      "  Remove rRNA using curated and exemplar sequences\n",
      "biolite.utils.safe_mkdir: creating directory 'rrna'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-6/rrna' already exists\n",
      "agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
      "biolite.pipeline.run: \n",
      "  STAGE 15 / assemble.estimate_confidence / 178.400s / 783.2MB\n",
      "  Estimate coverage and confidence values for each transcript\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.pipeline.run: \n",
      "  STAGE 16 / assemble.parse_confidence / 180.418s / 783.2MB\n",
      "  Parse estimated confidence scores and update database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 17 / transcriptome.write_sequences / 180.420s / 783.2MB\n",
      "  Write assembled sequences to FASTA\n",
      "biolite.pipeline.run: \n",
      "  STAGE 18 / translate.identify_orfs / 180.423s / 783.2MB\n",
      "  Identify long open reading frames\n",
      "biolite.pipeline.run: \n",
      "  STAGE 19 / translate.annotate_orfs / 180.730s / 783.2MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-6/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 20 / translate.select_orfs / 287.948s / 783.2MB\n",
      "  Select the open reading frame with the best evalue\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 287.961s / 783.2MB\n"
     ]
    }
   ],
   "source": [
    "!cd ~/agalma/scratch\n",
    "!agalma transcriptome --id SRX288285\n",
    "!agalma transcriptome --id SRX288430\n",
    "!agalma transcriptome --id SRX288431\n",
    "!agalma transcriptome --id SRX288432"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/import-7'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / import.setup_paths / 1.686s / 115.4MB\n",
      "  Determine the paths to the FASTA files\n",
      "__main__.setup_paths: found paths [u'/home/ljcohen/JGI_NEMVEC.fa']\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / import.parse_sequences / 1.688s / 115.4MB\n",
      "  Parse the sequences from the FASTA files\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 1.699s / 116.7MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/translate-8'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / translate.setup_sequences / 0.104s / 127.1MB\n",
      "  Locate a previous assemble or import run\n",
      "__main__.setup_sequences: using previous 'import' run id 7\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / translate.identify_orfs / 0.107s / 127.6MB\n",
      "  Identify long open reading frames\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / translate.annotate_orfs / 0.275s / 154.9MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/translate-8/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / translate.select_orfs / 54.905s / 309.0MB\n",
      "  Select the open reading frame with the best evalue\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 54.917s / 309.2MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-9'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / annotate.setup_sequences / 0.137s / 123.9MB\n",
      "  Locate a previous import run\n",
      "__main__.setup_sequences: using previous 'import' run id 7\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / annotate.annotate / 0.140s / 124.6MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-9/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / annotate.parse / 42.673s / 313.9MB\n",
      "  Parse the annotations into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 42.692s / 313.9MB\n"
     ]
    }
   ],
   "source": [
    "!agalma import --id JGI_NEMVEC\n",
    "!agalma translate --id JGI_NEMVEC\n",
    "!agalma annotate --id JGI_NEMVEC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/import-10'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / import.setup_paths / 1.970s / 115.8MB\n",
      "  Determine the paths to the FASTA files\n",
      "__main__.setup_paths: found paths [u'/home/ljcohen/NCBI_HYDMAG.pfa']\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / import.parse_sequences / 1.972s / 115.8MB\n",
      "  Parse the sequences from the FASTA files\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 1.983s / 117.3MB\n"
     ]
    }
   ],
   "source": [
    "!agalma import --id NCBI_HYDMAG --seq_type aa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-11'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / annotate.setup_sequences / 0.147s / 123.8MB\n",
      "  Locate a previous import run\n",
      "__main__.setup_sequences: using previous 'import' run id 10\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / annotate.annotate / 0.150s / 124.5MB\n",
      "  Blastp protein sequences against SwissProt\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-11/blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / annotate.parse / 56.932s / 307.5MB\n",
      "  Parse the annotations into the sequences table\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 56.945s / 307.5MB\n"
     ]
    }
   ],
   "source": [
    "!agalma annotate --id NCBI_HYDMAG"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/homologize-12'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / homologize.init / 0.315s / 123.7MB\n",
      "  Determine the version of gene entries to use and lookup species data\n",
      "agalma.database.latest_genes_version: using default genes version 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / homologize.write_fasta / 0.320s / 124.2MB\n",
      "  Write sequences from the Agalma database to a FASTA file\n",
      "biolite.utils.safe_mkdir: creating directory 'blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / homologize.prepare_blast / 0.324s / 125.0MB\n",
      "  Prepare all-by-all BLAST database and command list\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/homologize-12/blastp' already exists\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / homologize.run_blast / 0.433s / 156.6MB\n",
      "  Run all-by-all BLAST\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / homologize.parse_edges / 1.317s / 157.1MB\n",
      "  Parse BLAST hits into edges weighted by bitscore\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / homologize.mcl_cluster / 1.363s / 157.4MB\n",
      "  Run mcl on all-by-all graph to form gene clusters\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / homologize.load_mcl_cluster / 1.406s / 157.4MB\n",
      "  Load cluster file from mcl into homology database\n",
      "__main__.load_mcl_cluster: histogram of gene cluster sizes:\n",
      " 2\t:\t3\n",
      " 3\t:\t3\n",
      " 4\t:\t1\n",
      " 5\t:\t2\n",
      " 7\t:\t1\n",
      " 9\t:\t1\n",
      " 12\t:\t1\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 1.412s / 157.7MB\n"
     ]
    }
   ],
   "source": [
    "!agalma homologize --id PhylogenyTest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-14'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / multalign.init / 0.098s / 115.4MB\n",
      "  Locate a previous homology or treeprune run\n",
      "__main__.init: using previous 'homologize' run id 13\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / multalign.select_clusters / 0.100s / 115.5MB\n",
      "  \n",
      "\tSelect a cluster for each homologize component that meets size, sequence\n",
      "\tlength, and composition requirements\n",
      "\t\n",
      "biolite.utils.safe_mkdir: creating directory 'clusters'\n",
      "agalma.database.select_homology_models: found the following taxa for homology id 13:\n",
      " Agalma_elegans (SRX288285)\n",
      " Nematostella_vectensis (JGI_NEMVEC)\n",
      " Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n",
      " Craseoa_lathetica (SRX288432)\n",
      " Physalia_physalis (SRX288431)\n",
      " Nanomia_bijuga (SRX288430)\n",
      " Hydra_magnipapillata (NCBI_HYDMAG)\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / multalign.align_sequences / 0.107s / 116.2MB\n",
      "  Align sequences within each component\n",
      "biolite.utils.safe_mkdir: creating directory 'alignments'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / multalign.cleanup_alignments / 4.124s / 290.8MB\n",
      "  Clean up aligned sequences with Gblocks\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / multalign.parse_alignments / 4.520s / 290.9MB\n",
      "  Parse the cleaned sequences into the database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 4.541s / 291.3MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/genetree-15'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / genetree.init / 0.351s / 108.9MB\n",
      "  Find alignments in database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / genetree.genetrees / 0.353s / 109.1MB\n",
      "  Build gene trees from alignments\n",
      "biolite.utils.safe_mkdir: creating directory 'alignments'\n",
      "biolite.utils.safe_mkdir: creating directory 'trees'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / genetree.parse / 9.839s / 127.7MB\n",
      "  Parse the trees into the database. Check for jobs that timed out.\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 9.841s / 127.8MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/treeinform-16'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / treeinform.init / 0.351s / 123.5MB\n",
      "  Determine path to input trees\n",
      "__main__.init: found genetree run 15\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / treeinform.identify_candidate_variants / 0.354s / 123.6MB\n",
      "  Identify candidates variants\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / treeinform.reassign_genes / 0.370s / 123.6MB\n",
      "  Reassign candidate variants to the same gene\n",
      "agalma.database.validate_genes: Validating model IDs:\n",
      "\t\t  unique model_id: 132\n",
      "\t\t=    all model_id: 132\n",
      "agalma.database.validate_genes: Validating number of transcripts:\n",
      "\t\t  original assembly: 132\n",
      "\t\t=  revised assembly: 132\n",
      "agalma.database.validate_genes: Validating number of genes:\n",
      "\t\t  original assembly: 62\n",
      "\t\t-        reassigned: 6\n",
      "\t\t+     newly created: 3\n",
      "\t\t=  revised assembly: 59\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 0.379s / 124.1MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/homologize-17'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / homologize.init / 0.108s / 124.1MB\n",
      "  Determine the version of gene entries to use and lookup species data\n",
      "agalma.database.latest_genes_version: using genes version 16\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / homologize.write_fasta / 0.112s / 124.6MB\n",
      "  Write sequences from the Agalma database to a FASTA file\n",
      "biolite.utils.safe_mkdir: creating directory 'blastp'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / homologize.prepare_blast / 0.116s / 125.4MB\n",
      "  Prepare all-by-all BLAST database and command list\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/homologize-17/blastp' already exists\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / homologize.run_blast / 0.221s / 157.0MB\n",
      "  Run all-by-all BLAST\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / homologize.parse_edges / 1.102s / 157.6MB\n",
      "  Parse BLAST hits into edges weighted by bitscore\n",
      "biolite.pipeline.run: \n",
      "  STAGE 5 / homologize.mcl_cluster / 1.148s / 157.9MB\n",
      "  Run mcl on all-by-all graph to form gene clusters\n",
      "biolite.pipeline.run: \n",
      "  STAGE 6 / homologize.load_mcl_cluster / 1.186s / 157.9MB\n",
      "  Load cluster file from mcl into homology database\n",
      "__main__.load_mcl_cluster: histogram of gene cluster sizes:\n",
      " 2\t:\t3\n",
      " 3\t:\t3\n",
      " 4\t:\t2\n",
      " 5\t:\t1\n",
      " 6\t:\t1\n",
      " 8\t:\t1\n",
      " 12\t:\t1\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 1.191s / 158.2MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-18'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / multalign.init / 0.149s / 115.6MB\n",
      "  Locate a previous homology or treeprune run\n",
      "__main__.init: using previous 'homologize' run id 17\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / multalign.select_clusters / 0.151s / 115.7MB\n",
      "  \n",
      "\tSelect a cluster for each homologize component that meets size, sequence\n",
      "\tlength, and composition requirements\n",
      "\t\n",
      "biolite.utils.safe_mkdir: creating directory 'clusters'\n",
      "agalma.database.select_homology_models: found the following taxa for homology id 17:\n",
      " Hydra_magnipapillata (NCBI_HYDMAG)\n",
      " Nematostella_vectensis (JGI_NEMVEC)\n",
      " Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n",
      " Craseoa_lathetica (SRX288432)\n",
      " Physalia_physalis (SRX288431)\n",
      " Nanomia_bijuga (SRX288430)\n",
      " Agalma_elegans (SRX288285)\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / multalign.align_sequences / 0.157s / 116.5MB\n",
      "  Align sequences within each component\n",
      "biolite.utils.safe_mkdir: creating directory 'alignments'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / multalign.cleanup_alignments / 4.011s / 288.1MB\n",
      "  Clean up aligned sequences with Gblocks\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / multalign.parse_alignments / 4.458s / 288.1MB\n",
      "  Parse the cleaned sequences into the database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 4.470s / 288.5MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/genetree-19'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / genetree.init / 0.105s / 108.9MB\n",
      "  Find alignments in database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / genetree.genetrees / 0.109s / 109.1MB\n",
      "  Build gene trees from alignments\n",
      "biolite.utils.safe_mkdir: creating directory 'alignments'\n",
      "biolite.utils.safe_mkdir: creating directory 'trees'\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / genetree.parse / 11.857s / 127.8MB\n",
      "  Parse the trees into the database. Check for jobs that timed out.\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 11.859s / 127.9MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/treeprune-20'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / treeprune.init / 0.476s / 123.5MB\n",
      "  Determine path to input trees\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / treeprune.prune_trees / 0.477s / 123.5MB\n",
      "  Prune each tree using monophyly masking and paralogy pruning\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / treeprune.parse_trees / 0.503s / 124.1MB\n",
      "  Parse the tips of each tree to create a cluster in the database\n",
      "__main__.parse_trees: histogram of gene cluster sizes:\n",
      "4\t:\t3\n",
      "5\t:\t2\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 0.514s / 124.6MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-21'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / multalign.init / 0.062s / 115.4MB\n",
      "  Locate a previous homology or treeprune run\n",
      "__main__.init: using previous 'treeprune' run id 20\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / multalign.select_clusters / 0.066s / 115.5MB\n",
      "  \n",
      "\tSelect a cluster for each homologize component that meets size, sequence\n",
      "\tlength, and composition requirements\n",
      "\t\n",
      "biolite.utils.safe_mkdir: creating directory 'clusters'\n",
      "agalma.database.select_homology_models: found the following taxa for homology id 20:\n",
      " Agalma_elegans (SRX288285)\n",
      " Nematostella_vectensis (JGI_NEMVEC)\n",
      " Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n",
      " Craseoa_lathetica (SRX288432)\n",
      " Hydra_magnipapillata (NCBI_HYDMAG)\n",
      " Nanomia_bijuga (SRX288430)\n",
      " Physalia_physalis (SRX288431)\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / multalign.align_sequences / 0.074s / 116.2MB\n",
      "  Align sequences within each component\n",
      "biolite.utils.safe_mkdir: creating directory 'alignments'\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.pipeline.run: \n",
      "  STAGE 3 / multalign.cleanup_alignments / 3.275s / 248.8MB\n",
      "  Clean up aligned sequences with Gblocks\n",
      "biolite.pipeline.run: \n",
      "  STAGE 4 / multalign.parse_alignments / 3.738s / 248.8MB\n",
      "  Parse the cleaned sequences into the database\n",
      "__main__.parse_alignments: dropping sequence Physalia_physalis@64 in cluster 21\n",
      "__main__.parse_alignments: dropping cluster 21\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 3.753s / 249.1MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/supermatrix-22'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / supermatrix.init / 0.127s / 127.5MB\n",
      "  Find alignments in database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / supermatrix.supermatrix / 0.134s / 128.3MB\n",
      "  Concatenate multiple alignments into a supermatrix\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / supermatrix.trim / 0.137s / 128.3MB\n",
      "  Trim the supermatrix to the specified proportion of occupancy\n",
      "__main__.trim: no proportion specified... skipping\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / supermatrix.parse / 0.139s / 128.3MB\n",
      "  Store the supermatrix in the database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 0.144s / 128.5MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/speciestree-23'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / speciestree.init / 0.111s / 124.0MB\n",
      "  Find supermatrix in database\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / speciestree.speciestree / 0.113s / 124.3MB\n",
      "  Build species tree with bootstraps\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / speciestree.parse / 2.081s / 155.0MB\n",
      "  Parse the tree into the database\n",
      "__main__.parse: species tree:\n",
      "                                 /---------------------- Craseoa lathetica     \n",
      "                      /----------@                                             \n",
      "                      |          |          /----------- Agalma elegans        \n",
      "           /----------@          \\----------@                                  \n",
      "           |          |                     \\----------- Nanomia bijuga        \n",
      "/----------@          |                                                        \n",
      "|          |          \\--------------------------------- Physalia physalis     \n",
      "@          |                                                                   \n",
      "|          \\-------------------------------------------- Hydra magnipapillata  \n",
      "|                                                                              \n",
      "\\------------------------------------------------------- Nematostella vectensis\n",
      "                                                                               \n",
      "                                                                               \n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 2.087s / 155.2MB\n"
     ]
    }
   ],
   "source": [
    "!agalma multalign --id PhylogenyTest\n",
    "!agalma genetree --id PhylogenyTest\n",
    "!agalma treeinform --id PhylogenyTest\n",
    "!agalma homologize --id PhylogenyTest\n",
    "!agalma multalign --id PhylogenyTest\n",
    "!agalma genetree --id PhylogenyTest\n",
    "!agalma treeprune --id PhylogenyTest\n",
    "!agalma multalign --id PhylogenyTest\n",
    "!agalma supermatrix --id PhylogenyTest\n",
    "!agalma speciestree --id PhylogenyTest --outgroup Nematostella_vectensis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest'\n",
      "agalma.agalma_report.report_runs: no catalog entry found for id 'PhylogenyTest'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/css'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/img'\n",
      "agalma.agalma_report.report_runs: 12 has pipelines: homologize\n",
      "agalma.agalma_report.report_runs: added homologize report for 12\n",
      "/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.\n",
      "  warnings.warn(\"No labelled objects found. \"\n",
      "agalma.agalma_report.report_runs: 13 has pipelines: homologize\n",
      "agalma.agalma_report.report_runs: added homologize report for 13\n",
      "agalma.agalma_report.report_runs: 14 has pipelines: multalign\n",
      "agalma.agalma_report.report_runs: added multalign report for 14\n",
      "agalma.agalma_report.report_runs: 15 has pipelines: genetree\n",
      "agalma.agalma_report.report_runs: added genetree report for 15\n",
      "agalma.agalma_report.report_runs: 16 has pipelines: treeinform\n",
      "agalma.agalma_report.report_runs: 17 has pipelines: homologize\n",
      "agalma.agalma_report.report_runs: added homologize report for 17\n",
      "agalma.agalma_report.report_runs: 18 has pipelines: multalign\n",
      "agalma.agalma_report.report_runs: added multalign report for 18\n",
      "agalma.agalma_report.report_runs: 19 has pipelines: genetree\n",
      "agalma.agalma_report.report_runs: added genetree report for 19\n",
      "agalma.agalma_report.report_runs: 20 has pipelines: treeprune\n",
      "agalma.agalma_report.report_runs: added treeprune report for 20\n",
      "agalma.agalma_report.report_runs: 21 has pipelines: multalign\n",
      "agalma.agalma_report.report_runs: added multalign report for 21\n",
      "agalma.agalma_report.report_runs: 22 has pipelines: supermatrix\n",
      "agalma.agalma_report.report_runs: added supermatrix report for 22\n",
      "agalma.agalma_report.report_runs: 23 has pipelines: speciestree\n",
      "agalma.agalma_report.report_runs: added speciestree report for 23\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/js'\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/js' already exists\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/js' already exists\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: directory 'reports/PhylogenyTest' already exists\n",
      "/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'Arial'] not found. Falling back to DejaVu Sans\n",
      "  (prop.get_family(), self.defaultFamily[fontext]))\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/css' already exists\n",
      "biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/img' already exists\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: directory 'reports/PhylogenyTest' already exists\n",
      "Saved figure to '/home/ljcohen/reports/PhylogenyTest/PhylogenyTest.pdf'\n"
     ]
    }
   ],
   "source": [
    "!agalma report --id PhylogenyTest --outdir reports/PhylogenyTest\n",
    "!agalma resources --id PhylogenyTest --outdir reports/PhylogenyTest\n",
    "!agalma phylogeny_report --id PhylogenyTest --outdir reports/PhylogenyTest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "DONE RUN CATALOG_ID               NAME          HOSTNAME                      USERNAME TIMESTAMP                  HID\n",
      "*    1   HWI-ST625-73-C0JUVACXX-7 qc            js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:03:05.395168    \n",
      "*    2   HWI-ST625-73-C0JUVACXX-7 transcriptome js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:03:18.557161    \n",
      "*    3   SRX288285                transcriptome js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:18:02.276951    \n",
      "*    4   SRX288430                transcriptome js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:23:35.449028    \n",
      "*    5   SRX288431                transcriptome js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:26:14.802972    \n",
      "*    6   SRX288432                transcriptome js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:28:08.248150    \n",
      "*    7   JGI_NEMVEC               import        js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:33:35.147099    \n",
      "*    8   JGI_NEMVEC               translate     js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:33:37.988784    \n",
      "*    9   JGI_NEMVEC               annotate      js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:34:33.933559    \n",
      "*    10  NCBI_HYDMAG              import        js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:35:17.724920    \n",
      "*    11  NCBI_HYDMAG              annotate      js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:35:47.577219    \n",
      "*    12  PhylogenyTest            homologize    js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:36:55.574199    \n",
      "*    13  PhylogenyTest            homologize    js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:37:09.988285    \n",
      "*    14  PhylogenyTest            multalign     js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:37:52.227894    \n",
      "*    15  PhylogenyTest            genetree      js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:37:58.123674    \n",
      "*    16  PhylogenyTest            treeinform    js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:09.355908    \n",
      "*    17  PhylogenyTest            homologize    js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:11.042958    \n",
      "*    18  PhylogenyTest            multalign     js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:13.203888    \n",
      "*    19  PhylogenyTest            genetree      js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:18.624530    \n",
      "*    20  PhylogenyTest            treeprune     js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:33.165721    \n",
      "*    21  PhylogenyTest            multalign     js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:34.677996    \n",
      "*    22  PhylogenyTest            supermatrix   js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:39.458883    \n",
      "*    23  PhylogenyTest            speciestree   js-169-78.jetstream-cloud.org ljcohen  2018-03-05T16:38:40.716004    \n"
     ]
    }
   ],
   "source": [
    "!agalma diagnostics list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "SRX033366 [2018-03-05 21:58:41]\n",
      "/home/ljcohen/SRX033366.fq (1.7 MB)\n",
      "  species: Nanomia bijuga\n",
      "  ncbi_id: 168759\n",
      "  itis_id: 51389\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: specimen-1\n",
      "  treatment: gastrozooids\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n",
      "biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
      "SRX036876 [2018-03-05 21:58:42]\n",
      "/home/ljcohen/SRX036876.fq (1.9 MB)\n",
      "  species: Nanomia bijuga\n",
      "  ncbi_id: 168759\n",
      "  itis_id: 51389\n",
      "  extraction_id: None\n",
      "  library_id: None\n",
      "  library_type: None\n",
      "  individual: specimen-2\n",
      "  treatment: gastrozooids\n",
      "  sequencer: None\n",
      "  seq_center: None\n",
      "  note: None\n",
      "  sample_prep: None\n"
     ]
    }
   ],
   "source": [
    "!cd ~/agalma/data\n",
    "!agalma catalog insert --id SRX033366 --paths SRX033366.fq --species \"Nanomia bijuga\" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-1\n",
    "!agalma catalog insert --id SRX036876 --paths SRX036876.fq --species \"Nanomia bijuga\" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-40'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / qc.setup_data / 0.146s / 108.7MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / qc.fastqc / 0.148s / 108.7MB\n",
      "  Generate FastQC reports for each FASTQ file\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / qc.parse / 3.690s / 385.1MB\n",
      "  Parse FastQC reports into the database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 3.700s / 386.9MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-41'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / qc.setup_data / 0.082s / 108.4MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / qc.fastqc / 0.085s / 108.4MB\n",
      "  Generate FastQC reports for each FASTQ file\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / qc.parse / 3.579s / 388.6MB\n",
      "  Parse FastQC reports into the database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 3.587s / 390.6MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/expression-42'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / expression.setup_data / 0.059s / 123.3MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / expression.setup_reference / 0.061s / 123.3MB\n",
      "  Locate reference sequences in the Agalma database\n",
      "__main__.setup_reference: using previous 'transcriptome' run id 4\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / expression.calculate / 0.068s / 124.2MB\n",
      "  Calculate gene and isoform expression with RSEM\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / expression.parse_counts / 1.359s / 186.9MB\n",
      "  Parse gene-level counts into Agalma database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 1.363s / 187.1MB\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/expression-43'\n",
      "biolite.pipeline.run: Starting at stage 0\n",
      "biolite.pipeline.run: \n",
      "  STAGE 0 / expression.setup_data / 0.084s / 123.2MB\n",
      "  Setup paths to the FASTQ input sequence data\n",
      "biolite.pipeline.setup_data: reading data from paths in catalog\n",
      "biolite.pipeline.run: \n",
      "  STAGE 1 / expression.setup_reference / 0.085s / 123.2MB\n",
      "  Locate reference sequences in the Agalma database\n",
      "__main__.setup_reference: using previous 'transcriptome' run id 4\n",
      "biolite.pipeline.run: \n",
      "  STAGE 2 / expression.calculate / 0.093s / 124.1MB\n",
      "  Calculate gene and isoform expression with RSEM\n",
      "biolite.pipeline.run: \n",
      "  STAGE 3 / expression.parse_counts / 1.179s / 186.8MB\n",
      "  Parse gene-level counts into Agalma database\n",
      "biolite.pipeline.run: \n",
      "  FINISHED / 1.183s / 187.0MB\n"
     ]
    }
   ],
   "source": [
    "!cd ~/agalma/scratch\n",
    "!agalma qc --id SRX033366\n",
    "!agalma qc --id SRX036876\n",
    "!agalma expression --id SRX033366 SRX288430\n",
    "!agalma expression --id SRX036876 SRX288430"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366/css'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366/img'\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 24\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 26\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 28\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 30\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 32\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 34\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 36\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 38\n",
      "agalma.agalma_report.report_runs: 40 has pipelines: qc\n",
      "agalma.agalma_report.report_runs: added qc report for 40\n",
      "agalma.agalma_report.report_runs: 42 has pipelines: expression\n",
      "agalma.agalma_report.report_runs: added expression report for 42\n",
      "biolite.config.parse_env_resources: threads=6\n",
      "biolite.config.parse_env_resources: memory=14441M\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876/css'\n",
      "biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876/img'\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 25\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 27\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 29\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 31\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 33\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 35\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 37\n",
      "agalma.agalma_report.report_runs: skipping unfinished run 39\n",
      "agalma.agalma_report.report_runs: 41 has pipelines: qc\n",
      "agalma.agalma_report.report_runs: added qc report for 41\n",
      "agalma.agalma_report.report_runs: 43 has pipelines: expression\n",
      "agalma.agalma_report.report_runs: added expression report for 43\n"
     ]
    }
   ],
   "source": [
    "!agalma report --id SRX033366 --outdir reports/SRX033366\n",
    "!agalma report --id SRX036876 --outdir reports/SRX036876"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# To download results to local computer:\n",
    "\n",
    "```\n",
    "scp -r ljcohen@149.165.169.78:/home/ljcohen/reports/ ~/Documents/agalma_tutorial/\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}