Skip to content

Instantly share code, notes, and snippets.

@johnsolk
Last active March 6, 2018 16:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save johnsolk/2d8f82216dc35804952cf8d44ec39722 to your computer and use it in GitHub Desktop.
Save johnsolk/2d8f82216dc35804952cf8d44ec39722 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring a phylotranscriptmics workflow with [agalma](https://bitbucket.org/caseywdunn/agalma)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Agalma tutorial](https://bitbucket.org/caseywdunn/agalma/src/master/TUTORIAL.md) by the [Dunn lab](http://dunnlab.org/)\n",
"* Followed \"Quick Install - Anaconda Python\" [Installation instructions](https://bitbucket.org/caseywdunn/agalma)\n",
"* Started an `m1.medium` instance (CPU: 6, Mem: 16 GB, Disk: 60 GB) with Ubuntu 16.04 on [Jetstream](https://use.jetstream-cloud.org/application/images)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install [jupyter notebook](https://github.com/ngs-docs/2018-ggg201b/blob/master/lab5-assembly-eval/README.md) on the instance\n",
"\n",
"```\n",
"pip install jupyter\n",
"```\n",
"Then\n",
"```\n",
"jupyter notebook --generate-config\n",
"```\n",
"Then generate a config file. (Note: this password protects the notebook.)\n",
"```\n",
"cat >> ~/.jupyter/jupyter_notebook_config.py <<EOF\n",
"c = get_config()\n",
"c.NotebookApp.ip = '*'\n",
"c.NotebookApp.open_browser = False\n",
"c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'\n",
"c.NotebookApp.port = 8000\n",
"\n",
"EOF\n",
"```\n",
"Now, run it!\n",
"\n",
"```\n",
"jupyter notebook &\n",
"\n",
"```\n",
"(Press `Enter`, will return to commandline)\n",
"\n",
"You can figure out what Web address to connect to this way:\n",
"\n",
"```\n",
"echo http://$(hostname):8000/\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test to make sure agalma installation worked:\n",
"```\n",
"mkdir ~/tmp\n",
"cd ~/tmp\n",
"agalma test\n",
"```\n",
"[Test ran successfully.](https://gist.github.com/ljcohen/0aa14000bfbfec62fffe4893684e1bb6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Run the [agalma tutorial](https://bitbucket.org/caseywdunn/agalma/src/master/TUTORIAL.md?fileviewer=file-view-default)\n",
"Download test data:\n",
"```\n",
"cd agalma/data\n",
"agalma testdata\n",
"```\n",
"This output filenames. \n",
"\n",
"Default threads and memory on the `m1.medium` machine:\n",
"```\n",
"agalma -t 4 -m 14G\n",
"```\n",
"Then proceeded with tutorial steps:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"HWI-ST625-73-C0JUVACXX-7 [2018-03-05 20:43:28]\n",
"/home/ljcohen/SRX288285_1.fq (17.3 MB)\n",
"/home/ljcohen/SRX288285_2.fq (17.3 MB)\n",
" species: Agalma elegans\n",
" ncbi_id: 316166\n",
" itis_id: 51383\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n"
]
}
],
"source": [
"!agalma catalog insert --paths SRX288285_1.fq SRX288285_2.fq --species \"Agalma elegans\" --ncbi_id 316166 --itis_id 51383"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## QC"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-1'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / qc.setup_data / 0.189s / 110.4MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / qc.fastqc / 0.191s / 110.6MB\n",
" Generate FastQC reports for each FASTQ file\n",
"biolite.pipeline.run: \n",
" STAGE 2 / qc.parse / 8.276s / 543.4MB\n",
" Parse FastQC reports into the database\n",
"biolite.pipeline.run: \n",
" FINISHED / 8.290s / 547.2MB\n"
]
}
],
"source": [
"!cd ~/agalma/scratch\n",
"!agalma qc --id HWI-ST625-73-C0JUVACXX-7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transcriptome subset assembly and exmplar contig identification"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-2'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / insert_size.setup_data / -0.552s / 124.0MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / insert_size.assemble_subset / -0.550s / 124.0MB\n",
" Assemble a subset of high quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 2 / insert_size.estimate_insert / 51.403s / 1515.7MB\n",
" Estimate insert size by mapping the subset against the assembly\n",
"biolite.pipeline.run: \n",
" STAGE 3 / rrna.assemble_subsets / 57.108s / 1516.1MB\n",
" Assemble subsets of increasing numbers of reads\n",
"biolite.pipeline.run: \n",
" STAGE 4 / rrna.blast_transcripts / 182.927s / 1517.6MB\n",
" Blast transcripts against known rRNA database\n",
"biolite.pipeline.run: \n",
" STAGE 5 / rrna.find_exemplars / 183.321s / 1517.7MB\n",
" Parse blast output for exemplar rRNA sequences\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
"biolite.pipeline.run: \n",
" STAGE 6 / rrna.map_reads / 183.344s / 1517.9MB\n",
" Map reads against rRNA exemplars\n",
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 7 / rrna.exclude_reads / 183.346s / 1517.9MB\n",
" Exclude pairs where either read maps to an rRNA exemplar\n",
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 8 / transcriptome.assemble_connector / 183.348s / 1517.9MB\n",
" [connector between \"rrna\" and \"assemble\"]\n",
"biolite.pipeline.run: \n",
" STAGE 9 / assemble.setup_rrna / 183.349s / 1517.9MB\n",
" Retrieve the rRNA exemplars from the database\n",
"agalma.assemble.setup_rrna: no previous rrna run found for id HWI-ST625-73-C0JUVACXX-7\n",
"biolite.pipeline.run: \n",
" STAGE 10 / assemble.filter_data / 183.351s / 1518.0MB\n",
" Filter out low-quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 11 / assemble.assemble / 186.372s / 1518.0MB\n",
" Assemble the filtered reads with Trinity\n",
"biolite.pipeline.run: \n",
" STAGE 12 / assemble.parse_assembly / 258.908s / 1576.5MB\n",
" Parse the assembly into the sequences table\n",
"biolite.pipeline.run: \n",
" STAGE 13 / assemble.remove_vectors / 258.922s / 1577.3MB\n",
" Remove vector contaminants with UniVec\n",
"biolite.utils.safe_mkdir: creating directory 'univec'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-2/univec' already exists\n",
"agalma.assemble.remove_vectors: found 0 vector contaminants\n",
"biolite.pipeline.run: \n",
" STAGE 14 / assemble.remove_rrna / 259.470s / 1577.5MB\n",
" Remove rRNA using curated and exemplar sequences\n",
"biolite.utils.safe_mkdir: creating directory 'rrna'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-2/rrna' already exists\n",
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
"biolite.pipeline.run: \n",
" STAGE 15 / assemble.estimate_confidence / 260.139s / 1577.5MB\n",
" Estimate coverage and confidence values for each transcript\n",
"biolite.pipeline.run: \n",
" STAGE 16 / assemble.parse_confidence / 268.327s / 1577.5MB\n",
" Parse estimated confidence scores and update database\n",
"biolite.pipeline.run: \n",
" STAGE 17 / transcriptome.write_sequences / 268.331s / 1577.5MB\n",
" Write assembled sequences to FASTA\n",
"biolite.pipeline.run: \n",
" STAGE 18 / translate.identify_orfs / 268.334s / 1577.5MB\n",
" Identify long open reading frames\n",
"biolite.pipeline.run: \n",
" STAGE 19 / translate.annotate_orfs / 268.671s / 1577.5MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-2/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 20 / translate.select_orfs / 350.139s / 1577.5MB\n",
" Select the open reading frame with the best evalue\n",
"biolite.pipeline.run: \n",
" FINISHED / 350.153s / 1577.5MB\n"
]
}
],
"source": [
"!agalma transcriptome --id HWI-ST625-73-C0JUVACXX-7"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/css'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/img'\n",
"agalma.agalma_report.report_runs: 1 has pipelines: qc\n",
"agalma.agalma_report.report_runs: added qc report for 1\n",
"agalma.agalma_report.report_runs: 2 has pipelines: assemble,translate,rrna,transcriptome,insert_size\n",
"agalma.agalma_report.report_runs: added insert_size report for 2\n",
"agalma.agalma_report.report_runs: added rrna report for 2\n",
"agalma.agalma_report.report_runs: added assemble report for 2\n",
"agalma.agalma_report.report_runs: added translate report for 2\n"
]
}
],
"source": [
"!agalma report --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: directory 'reports/HWI-ST625-73-C0JUVACXX-7' already exists\n",
"/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'Arial'] not found. Falling back to DejaVu Sans\n",
" (prop.get_family(), self.defaultFamily[fontext]))\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/css' already exists\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/img' already exists\n"
]
}
],
"source": [
"!agalma resources --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"SRX288285 [2018-03-05 21:17:25]\n",
"/home/ljcohen/SRX288285_1.fq (17.3 MB)\n",
"/home/ljcohen/SRX288285_2.fq (17.3 MB)\n",
" species: Agalma elegans\n",
" ncbi_id: 316166\n",
" itis_id: None\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n",
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"SRX288432 [2018-03-05 21:17:26]\n",
"/home/ljcohen/SRX288432_1.fq (2.5 MB)\n",
"/home/ljcohen/SRX288432_2.fq (2.5 MB)\n",
" species: Craseoa lathetica\n",
" ncbi_id: 316205\n",
" itis_id: None\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n",
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"SRX288431 [2018-03-05 21:17:26]\n",
"/home/ljcohen/SRX288431_1.fq (252 KB)\n",
"/home/ljcohen/SRX288431_2.fq (252 KB)\n",
" species: Physalia physalis\n",
" ncbi_id: 168775\n",
" itis_id: None\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n",
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"SRX288430 [2018-03-05 21:17:27]\n",
"/home/ljcohen/SRX288430_1.fq (757 KB)\n",
"/home/ljcohen/SRX288430_2.fq (757 KB)\n",
" species: Nanomia bijuga\n",
" ncbi_id: 168759\n",
" itis_id: None\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n",
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"JGI_NEMVEC [2018-03-05 21:17:28]\n",
"/home/ljcohen/JGI_NEMVEC.fa (22 KB)\n",
" species: Nematostella vectensis\n",
" ncbi_id: 45351\n",
" itis_id: None\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n",
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"NCBI_HYDMAG [2018-03-05 21:17:29]\n",
"/home/ljcohen/NCBI_HYDMAG.pfa (7 KB)\n",
" species: Hydra magnipapillata\n",
" ncbi_id: 6085\n",
" itis_id: None\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: None\n",
" treatment: None\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n"
]
}
],
"source": [
"!cd ~/agalma/data\n",
"!agalma catalog insert --id SRX288285 --paths SRX288285_1.fq SRX288285_2.fq --species \"Agalma elegans\" --ncbi_id 316166\n",
"!agalma catalog insert --id SRX288432 --paths SRX288432_1.fq SRX288432_2.fq --species \"Craseoa lathetica\" --ncbi_id 316205\n",
"!agalma catalog insert --id SRX288431 --paths SRX288431_1.fq SRX288431_2.fq --species \"Physalia physalis\" --ncbi_id 168775\n",
"!agalma catalog insert --id SRX288430 --paths SRX288430_1.fq SRX288430_2.fq --species \"Nanomia bijuga\" --ncbi_id 168759\n",
"!agalma catalog insert --id JGI_NEMVEC --paths JGI_NEMVEC.fa --species \"Nematostella vectensis\" --ncbi_id 45351\n",
"!agalma catalog insert --id NCBI_HYDMAG --paths NCBI_HYDMAG.pfa --species \"Hydra magnipapillata\" --ncbi_id 6085"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-3'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / insert_size.setup_data / 0.127s / 123.8MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / insert_size.assemble_subset / 0.130s / 124.0MB\n",
" Assemble a subset of high quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 2 / insert_size.estimate_insert / 68.322s / 1568.6MB\n",
" Estimate insert size by mapping the subset against the assembly\n",
"biolite.pipeline.run: \n",
" STAGE 3 / rrna.assemble_subsets / 74.084s / 1568.9MB\n",
" Assemble subsets of increasing numbers of reads\n",
"biolite.pipeline.run: \n",
" STAGE 4 / rrna.blast_transcripts / 201.843s / 1569.4MB\n",
" Blast transcripts against known rRNA database\n",
"biolite.pipeline.run: \n",
" STAGE 5 / rrna.find_exemplars / 202.224s / 1569.4MB\n",
" Parse blast output for exemplar rRNA sequences\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
"biolite.pipeline.run: \n",
" STAGE 6 / rrna.map_reads / 202.258s / 1569.7MB\n",
" Map reads against rRNA exemplars\n",
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 7 / rrna.exclude_reads / 202.260s / 1569.7MB\n",
" Exclude pairs where either read maps to an rRNA exemplar\n",
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 8 / transcriptome.assemble_connector / 202.263s / 1569.7MB\n",
" [connector between \"rrna\" and \"assemble\"]\n",
"biolite.pipeline.run: \n",
" STAGE 9 / assemble.setup_rrna / 202.264s / 1569.7MB\n",
" Retrieve the rRNA exemplars from the database\n",
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288285\n",
"biolite.pipeline.run: \n",
" STAGE 10 / assemble.filter_data / 202.267s / 1569.7MB\n",
" Filter out low-quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 11 / assemble.assemble / 205.101s / 1569.8MB\n",
" Assemble the filtered reads with Trinity\n",
"biolite.pipeline.run: \n",
" STAGE 12 / assemble.parse_assembly / 255.274s / 1569.8MB\n",
" Parse the assembly into the sequences table\n",
"biolite.pipeline.run: \n",
" STAGE 13 / assemble.remove_vectors / 255.288s / 1570.7MB\n",
" Remove vector contaminants with UniVec\n",
"biolite.utils.safe_mkdir: creating directory 'univec'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-3/univec' already exists\n",
"agalma.assemble.remove_vectors: found 0 vector contaminants\n",
"biolite.pipeline.run: \n",
" STAGE 14 / assemble.remove_rrna / 255.858s / 1570.9MB\n",
" Remove rRNA using curated and exemplar sequences\n",
"biolite.utils.safe_mkdir: creating directory 'rrna'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-3/rrna' already exists\n",
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
"biolite.pipeline.run: \n",
" STAGE 15 / assemble.estimate_confidence / 256.476s / 1570.9MB\n",
" Estimate coverage and confidence values for each transcript\n",
"biolite.pipeline.run: \n",
" STAGE 16 / assemble.parse_confidence / 265.440s / 1570.9MB\n",
" Parse estimated confidence scores and update database\n",
"biolite.pipeline.run: \n",
" STAGE 17 / transcriptome.write_sequences / 265.444s / 1570.9MB\n",
" Write assembled sequences to FASTA\n",
"biolite.pipeline.run: \n",
" STAGE 18 / translate.identify_orfs / 265.448s / 1570.9MB\n",
" Identify long open reading frames\n",
"biolite.pipeline.run: \n",
" STAGE 19 / translate.annotate_orfs / 265.697s / 1570.9MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-3/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 20 / translate.select_orfs / 332.014s / 1570.9MB\n",
" Select the open reading frame with the best evalue\n",
"biolite.pipeline.run: \n",
" FINISHED / 332.025s / 1570.9MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-4'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / insert_size.setup_data / 0.138s / 124.0MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / insert_size.assemble_subset / 0.141s / 124.2MB\n",
" Assemble a subset of high quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 2 / insert_size.estimate_insert / 13.886s / 719.2MB\n",
" Estimate insert size by mapping the subset against the assembly\n",
"biolite.pipeline.run: \n",
" STAGE 3 / rrna.assemble_subsets / 14.671s / 719.5MB\n",
" Assemble subsets of increasing numbers of reads\n",
"biolite.pipeline.run: \n",
" STAGE 4 / rrna.blast_transcripts / 108.762s / 722.0MB\n",
" Blast transcripts against known rRNA database\n",
"biolite.pipeline.run: \n",
" STAGE 5 / rrna.find_exemplars / 109.173s / 722.0MB\n",
" Parse blast output for exemplar rRNA sequences\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
"biolite.pipeline.run: \n",
" STAGE 6 / rrna.map_reads / 109.213s / 722.4MB\n",
" Map reads against rRNA exemplars\n",
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 7 / rrna.exclude_reads / 109.215s / 722.4MB\n",
" Exclude pairs where either read maps to an rRNA exemplar\n",
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 8 / transcriptome.assemble_connector / 109.218s / 722.4MB\n",
" [connector between \"rrna\" and \"assemble\"]\n",
"biolite.pipeline.run: \n",
" STAGE 9 / assemble.setup_rrna / 109.219s / 722.4MB\n",
" Retrieve the rRNA exemplars from the database\n",
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288430\n",
"biolite.pipeline.run: \n",
" STAGE 10 / assemble.filter_data / 109.222s / 722.4MB\n",
" Filter out low-quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 11 / assemble.assemble / 109.432s / 722.5MB\n",
" Assemble the filtered reads with Trinity\n",
"biolite.pipeline.run: \n",
" STAGE 12 / assemble.parse_assembly / 121.188s / 722.5MB\n",
" Parse the assembly into the sequences table\n",
"biolite.pipeline.run: \n",
" STAGE 13 / assemble.remove_vectors / 121.197s / 723.5MB\n",
" Remove vector contaminants with UniVec\n",
"biolite.utils.safe_mkdir: creating directory 'univec'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-4/univec' already exists\n",
"agalma.assemble.remove_vectors: found 0 vector contaminants\n",
"biolite.pipeline.run: \n",
" STAGE 14 / assemble.remove_rrna / 121.657s / 723.6MB\n",
" Remove rRNA using curated and exemplar sequences\n",
"biolite.utils.safe_mkdir: creating directory 'rrna'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-4/rrna' already exists\n",
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
"biolite.pipeline.run: \n",
" STAGE 15 / assemble.estimate_confidence / 122.221s / 723.7MB\n",
" Estimate coverage and confidence values for each transcript\n",
"biolite.pipeline.run: \n",
" STAGE 16 / assemble.parse_confidence / 123.356s / 723.7MB\n",
" Parse estimated confidence scores and update database\n",
"biolite.pipeline.run: \n",
" STAGE 17 / transcriptome.write_sequences / 123.360s / 723.7MB\n",
" Write assembled sequences to FASTA\n",
"biolite.pipeline.run: \n",
" STAGE 18 / translate.identify_orfs / 123.363s / 723.7MB\n",
" Identify long open reading frames\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.pipeline.run: \n",
" STAGE 19 / translate.annotate_orfs / 123.616s / 723.7MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-4/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 20 / translate.select_orfs / 156.639s / 723.7MB\n",
" Select the open reading frame with the best evalue\n",
"biolite.pipeline.run: \n",
" FINISHED / 156.651s / 723.7MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-5'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / insert_size.setup_data / 0.307s / 123.9MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / insert_size.assemble_subset / 0.308s / 124.1MB\n",
" Assemble a subset of high quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 2 / insert_size.estimate_insert / 11.024s / 719.2MB\n",
" Estimate insert size by mapping the subset against the assembly\n",
"biolite.pipeline.run: \n",
" STAGE 3 / rrna.assemble_subsets / 11.707s / 719.6MB\n",
" Assemble subsets of increasing numbers of reads\n",
"biolite.pipeline.run: \n",
" STAGE 4 / rrna.blast_transcripts / 79.731s / 720.1MB\n",
" Blast transcripts against known rRNA database\n",
"biolite.pipeline.run: \n",
" STAGE 5 / rrna.find_exemplars / 80.083s / 720.1MB\n",
" Parse blast output for exemplar rRNA sequences\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
"biolite.pipeline.run: \n",
" STAGE 6 / rrna.map_reads / 80.102s / 720.3MB\n",
" Map reads against rRNA exemplars\n",
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 7 / rrna.exclude_reads / 80.104s / 720.3MB\n",
" Exclude pairs where either read maps to an rRNA exemplar\n",
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 8 / transcriptome.assemble_connector / 80.106s / 720.3MB\n",
" [connector between \"rrna\" and \"assemble\"]\n",
"biolite.pipeline.run: \n",
" STAGE 9 / assemble.setup_rrna / 80.107s / 720.3MB\n",
" Retrieve the rRNA exemplars from the database\n",
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288431\n",
"biolite.pipeline.run: \n",
" STAGE 10 / assemble.filter_data / 80.109s / 720.3MB\n",
" Filter out low-quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 11 / assemble.assemble / 80.224s / 720.5MB\n",
" Assemble the filtered reads with Trinity\n",
"biolite.pipeline.run: \n",
" STAGE 12 / assemble.parse_assembly / 93.242s / 720.5MB\n",
" Parse the assembly into the sequences table\n",
"biolite.pipeline.run: \n",
" STAGE 13 / assemble.remove_vectors / 93.250s / 721.6MB\n",
" Remove vector contaminants with UniVec\n",
"biolite.utils.safe_mkdir: creating directory 'univec'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-5/univec' already exists\n",
"agalma.assemble.remove_vectors: found 0 vector contaminants\n",
"biolite.pipeline.run: \n",
" STAGE 14 / assemble.remove_rrna / 93.776s / 721.7MB\n",
" Remove rRNA using curated and exemplar sequences\n",
"biolite.utils.safe_mkdir: creating directory 'rrna'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-5/rrna' already exists\n",
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
"biolite.pipeline.run: \n",
" STAGE 15 / assemble.estimate_confidence / 94.345s / 721.7MB\n",
" Estimate coverage and confidence values for each transcript\n",
"biolite.pipeline.run: \n",
" STAGE 16 / assemble.parse_confidence / 95.076s / 721.7MB\n",
" Parse estimated confidence scores and update database\n",
"biolite.pipeline.run: \n",
" STAGE 17 / transcriptome.write_sequences / 95.079s / 721.7MB\n",
" Write assembled sequences to FASTA\n",
"biolite.pipeline.run: \n",
" STAGE 18 / translate.identify_orfs / 95.083s / 721.7MB\n",
" Identify long open reading frames\n",
"biolite.pipeline.run: \n",
" STAGE 19 / translate.annotate_orfs / 95.258s / 721.7MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-5/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 20 / translate.select_orfs / 112.402s / 721.7MB\n",
" Select the open reading frame with the best evalue\n",
"biolite.pipeline.run: \n",
" FINISHED / 112.406s / 721.7MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-6'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / insert_size.setup_data / 0.130s / 124.1MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / insert_size.assemble_subset / 0.132s / 124.2MB\n",
" Assemble a subset of high quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 2 / insert_size.estimate_insert / 24.007s / 734.4MB\n",
" Estimate insert size by mapping the subset against the assembly\n",
"biolite.pipeline.run: \n",
" STAGE 3 / rrna.assemble_subsets / 25.524s / 734.7MB\n",
" Assemble subsets of increasing numbers of reads\n",
"biolite.pipeline.run: \n",
" STAGE 4 / rrna.blast_transcripts / 156.028s / 781.3MB\n",
" Blast transcripts against known rRNA database\n",
"biolite.pipeline.run: \n",
" STAGE 5 / rrna.find_exemplars / 156.463s / 781.3MB\n",
" Parse blast output for exemplar rRNA sequences\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n",
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n",
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n",
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n",
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n",
"biolite.pipeline.run: \n",
" STAGE 6 / rrna.map_reads / 156.515s / 782.0MB\n",
" Map reads against rRNA exemplars\n",
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 7 / rrna.exclude_reads / 156.517s / 782.0MB\n",
" Exclude pairs where either read maps to an rRNA exemplar\n",
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 8 / transcriptome.assemble_connector / 156.519s / 782.0MB\n",
" [connector between \"rrna\" and \"assemble\"]\n",
"biolite.pipeline.run: \n",
" STAGE 9 / assemble.setup_rrna / 156.520s / 782.0MB\n",
" Retrieve the rRNA exemplars from the database\n",
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288432\n",
"biolite.pipeline.run: \n",
" STAGE 10 / assemble.filter_data / 156.522s / 782.1MB\n",
" Filter out low-quality reads\n",
"biolite.pipeline.run: \n",
" STAGE 11 / assemble.assemble / 156.956s / 782.1MB\n",
" Assemble the filtered reads with Trinity\n",
"biolite.pipeline.run: \n",
" STAGE 12 / assemble.parse_assembly / 177.411s / 782.1MB\n",
" Parse the assembly into the sequences table\n",
"biolite.pipeline.run: \n",
" STAGE 13 / assemble.remove_vectors / 177.421s / 783.0MB\n",
" Remove vector contaminants with UniVec\n",
"biolite.utils.safe_mkdir: creating directory 'univec'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-6/univec' already exists\n",
"agalma.assemble.remove_vectors: found 0 vector contaminants\n",
"biolite.pipeline.run: \n",
" STAGE 14 / assemble.remove_rrna / 177.858s / 783.2MB\n",
" Remove rRNA using curated and exemplar sequences\n",
"biolite.utils.safe_mkdir: creating directory 'rrna'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-6/rrna' already exists\n",
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n",
"biolite.pipeline.run: \n",
" STAGE 15 / assemble.estimate_confidence / 178.400s / 783.2MB\n",
" Estimate coverage and confidence values for each transcript\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.pipeline.run: \n",
" STAGE 16 / assemble.parse_confidence / 180.418s / 783.2MB\n",
" Parse estimated confidence scores and update database\n",
"biolite.pipeline.run: \n",
" STAGE 17 / transcriptome.write_sequences / 180.420s / 783.2MB\n",
" Write assembled sequences to FASTA\n",
"biolite.pipeline.run: \n",
" STAGE 18 / translate.identify_orfs / 180.423s / 783.2MB\n",
" Identify long open reading frames\n",
"biolite.pipeline.run: \n",
" STAGE 19 / translate.annotate_orfs / 180.730s / 783.2MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-6/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 20 / translate.select_orfs / 287.948s / 783.2MB\n",
" Select the open reading frame with the best evalue\n",
"biolite.pipeline.run: \n",
" FINISHED / 287.961s / 783.2MB\n"
]
}
],
"source": [
"!cd ~/agalma/scratch\n",
"!agalma transcriptome --id SRX288285\n",
"!agalma transcriptome --id SRX288430\n",
"!agalma transcriptome --id SRX288431\n",
"!agalma transcriptome --id SRX288432"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/import-7'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / import.setup_paths / 1.686s / 115.4MB\n",
" Determine the paths to the FASTA files\n",
"__main__.setup_paths: found paths [u'/home/ljcohen/JGI_NEMVEC.fa']\n",
"biolite.pipeline.run: \n",
" STAGE 1 / import.parse_sequences / 1.688s / 115.4MB\n",
" Parse the sequences from the FASTA files\n",
"biolite.pipeline.run: \n",
" FINISHED / 1.699s / 116.7MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/translate-8'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / translate.setup_sequences / 0.104s / 127.1MB\n",
" Locate a previous assemble or import run\n",
"__main__.setup_sequences: using previous 'import' run id 7\n",
"biolite.pipeline.run: \n",
" STAGE 1 / translate.identify_orfs / 0.107s / 127.6MB\n",
" Identify long open reading frames\n",
"biolite.pipeline.run: \n",
" STAGE 2 / translate.annotate_orfs / 0.275s / 154.9MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/translate-8/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 3 / translate.select_orfs / 54.905s / 309.0MB\n",
" Select the open reading frame with the best evalue\n",
"biolite.pipeline.run: \n",
" FINISHED / 54.917s / 309.2MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-9'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / annotate.setup_sequences / 0.137s / 123.9MB\n",
" Locate a previous import run\n",
"__main__.setup_sequences: using previous 'import' run id 7\n",
"biolite.pipeline.run: \n",
" STAGE 1 / annotate.annotate / 0.140s / 124.6MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-9/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 2 / annotate.parse / 42.673s / 313.9MB\n",
" Parse the annotations into the sequences table\n",
"biolite.pipeline.run: \n",
" FINISHED / 42.692s / 313.9MB\n"
]
}
],
"source": [
"!agalma import --id JGI_NEMVEC\n",
"!agalma translate --id JGI_NEMVEC\n",
"!agalma annotate --id JGI_NEMVEC"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/import-10'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / import.setup_paths / 1.970s / 115.8MB\n",
" Determine the paths to the FASTA files\n",
"__main__.setup_paths: found paths [u'/home/ljcohen/NCBI_HYDMAG.pfa']\n",
"biolite.pipeline.run: \n",
" STAGE 1 / import.parse_sequences / 1.972s / 115.8MB\n",
" Parse the sequences from the FASTA files\n",
"biolite.pipeline.run: \n",
" FINISHED / 1.983s / 117.3MB\n"
]
}
],
"source": [
"!agalma import --id NCBI_HYDMAG --seq_type aa"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-11'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / annotate.setup_sequences / 0.147s / 123.8MB\n",
" Locate a previous import run\n",
"__main__.setup_sequences: using previous 'import' run id 10\n",
"biolite.pipeline.run: \n",
" STAGE 1 / annotate.annotate / 0.150s / 124.5MB\n",
" Blastp protein sequences against SwissProt\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-11/blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 2 / annotate.parse / 56.932s / 307.5MB\n",
" Parse the annotations into the sequences table\n",
"biolite.pipeline.run: \n",
" FINISHED / 56.945s / 307.5MB\n"
]
}
],
"source": [
"!agalma annotate --id NCBI_HYDMAG"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/homologize-12'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / homologize.init / 0.315s / 123.7MB\n",
" Determine the version of gene entries to use and lookup species data\n",
"agalma.database.latest_genes_version: using default genes version 0\n",
"biolite.pipeline.run: \n",
" STAGE 1 / homologize.write_fasta / 0.320s / 124.2MB\n",
" Write sequences from the Agalma database to a FASTA file\n",
"biolite.utils.safe_mkdir: creating directory 'blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 2 / homologize.prepare_blast / 0.324s / 125.0MB\n",
" Prepare all-by-all BLAST database and command list\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/homologize-12/blastp' already exists\n",
"biolite.pipeline.run: \n",
" STAGE 3 / homologize.run_blast / 0.433s / 156.6MB\n",
" Run all-by-all BLAST\n",
"biolite.pipeline.run: \n",
" STAGE 4 / homologize.parse_edges / 1.317s / 157.1MB\n",
" Parse BLAST hits into edges weighted by bitscore\n",
"biolite.pipeline.run: \n",
" STAGE 5 / homologize.mcl_cluster / 1.363s / 157.4MB\n",
" Run mcl on all-by-all graph to form gene clusters\n",
"biolite.pipeline.run: \n",
" STAGE 6 / homologize.load_mcl_cluster / 1.406s / 157.4MB\n",
" Load cluster file from mcl into homology database\n",
"__main__.load_mcl_cluster: histogram of gene cluster sizes:\n",
" 2\t:\t3\n",
" 3\t:\t3\n",
" 4\t:\t1\n",
" 5\t:\t2\n",
" 7\t:\t1\n",
" 9\t:\t1\n",
" 12\t:\t1\n",
"biolite.pipeline.run: \n",
" FINISHED / 1.412s / 157.7MB\n"
]
}
],
"source": [
"!agalma homologize --id PhylogenyTest"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-14'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / multalign.init / 0.098s / 115.4MB\n",
" Locate a previous homology or treeprune run\n",
"__main__.init: using previous 'homologize' run id 13\n",
"biolite.pipeline.run: \n",
" STAGE 1 / multalign.select_clusters / 0.100s / 115.5MB\n",
" \n",
"\tSelect a cluster for each homologize component that meets size, sequence\n",
"\tlength, and composition requirements\n",
"\t\n",
"biolite.utils.safe_mkdir: creating directory 'clusters'\n",
"agalma.database.select_homology_models: found the following taxa for homology id 13:\n",
" Agalma_elegans (SRX288285)\n",
" Nematostella_vectensis (JGI_NEMVEC)\n",
" Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n",
" Craseoa_lathetica (SRX288432)\n",
" Physalia_physalis (SRX288431)\n",
" Nanomia_bijuga (SRX288430)\n",
" Hydra_magnipapillata (NCBI_HYDMAG)\n",
"biolite.pipeline.run: \n",
" STAGE 2 / multalign.align_sequences / 0.107s / 116.2MB\n",
" Align sequences within each component\n",
"biolite.utils.safe_mkdir: creating directory 'alignments'\n",
"biolite.pipeline.run: \n",
" STAGE 3 / multalign.cleanup_alignments / 4.124s / 290.8MB\n",
" Clean up aligned sequences with Gblocks\n",
"biolite.pipeline.run: \n",
" STAGE 4 / multalign.parse_alignments / 4.520s / 290.9MB\n",
" Parse the cleaned sequences into the database\n",
"biolite.pipeline.run: \n",
" FINISHED / 4.541s / 291.3MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/genetree-15'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / genetree.init / 0.351s / 108.9MB\n",
" Find alignments in database\n",
"biolite.pipeline.run: \n",
" STAGE 1 / genetree.genetrees / 0.353s / 109.1MB\n",
" Build gene trees from alignments\n",
"biolite.utils.safe_mkdir: creating directory 'alignments'\n",
"biolite.utils.safe_mkdir: creating directory 'trees'\n",
"biolite.pipeline.run: \n",
" STAGE 2 / genetree.parse / 9.839s / 127.7MB\n",
" Parse the trees into the database. Check for jobs that timed out.\n",
"biolite.pipeline.run: \n",
" FINISHED / 9.841s / 127.8MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/treeinform-16'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / treeinform.init / 0.351s / 123.5MB\n",
" Determine path to input trees\n",
"__main__.init: found genetree run 15\n",
"biolite.pipeline.run: \n",
" STAGE 1 / treeinform.identify_candidate_variants / 0.354s / 123.6MB\n",
" Identify candidates variants\n",
"biolite.pipeline.run: \n",
" STAGE 2 / treeinform.reassign_genes / 0.370s / 123.6MB\n",
" Reassign candidate variants to the same gene\n",
"agalma.database.validate_genes: Validating model IDs:\n",
"\t\t unique model_id: 132\n",
"\t\t= all model_id: 132\n",
"agalma.database.validate_genes: Validating number of transcripts:\n",
"\t\t original assembly: 132\n",
"\t\t= revised assembly: 132\n",
"agalma.database.validate_genes: Validating number of genes:\n",
"\t\t original assembly: 62\n",
"\t\t- reassigned: 6\n",
"\t\t+ newly created: 3\n",
"\t\t= revised assembly: 59\n",
"biolite.pipeline.run: \n",
" FINISHED / 0.379s / 124.1MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/homologize-17'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / homologize.init / 0.108s / 124.1MB\n",
" Determine the version of gene entries to use and lookup species data\n",
"agalma.database.latest_genes_version: using genes version 16\n",
"biolite.pipeline.run: \n",
" STAGE 1 / homologize.write_fasta / 0.112s / 124.6MB\n",
" Write sequences from the Agalma database to a FASTA file\n",
"biolite.utils.safe_mkdir: creating directory 'blastp'\n",
"biolite.pipeline.run: \n",
" STAGE 2 / homologize.prepare_blast / 0.116s / 125.4MB\n",
" Prepare all-by-all BLAST database and command list\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/homologize-17/blastp' already exists\n",
"biolite.pipeline.run: \n",
" STAGE 3 / homologize.run_blast / 0.221s / 157.0MB\n",
" Run all-by-all BLAST\n",
"biolite.pipeline.run: \n",
" STAGE 4 / homologize.parse_edges / 1.102s / 157.6MB\n",
" Parse BLAST hits into edges weighted by bitscore\n",
"biolite.pipeline.run: \n",
" STAGE 5 / homologize.mcl_cluster / 1.148s / 157.9MB\n",
" Run mcl on all-by-all graph to form gene clusters\n",
"biolite.pipeline.run: \n",
" STAGE 6 / homologize.load_mcl_cluster / 1.186s / 157.9MB\n",
" Load cluster file from mcl into homology database\n",
"__main__.load_mcl_cluster: histogram of gene cluster sizes:\n",
" 2\t:\t3\n",
" 3\t:\t3\n",
" 4\t:\t2\n",
" 5\t:\t1\n",
" 6\t:\t1\n",
" 8\t:\t1\n",
" 12\t:\t1\n",
"biolite.pipeline.run: \n",
" FINISHED / 1.191s / 158.2MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-18'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / multalign.init / 0.149s / 115.6MB\n",
" Locate a previous homology or treeprune run\n",
"__main__.init: using previous 'homologize' run id 17\n",
"biolite.pipeline.run: \n",
" STAGE 1 / multalign.select_clusters / 0.151s / 115.7MB\n",
" \n",
"\tSelect a cluster for each homologize component that meets size, sequence\n",
"\tlength, and composition requirements\n",
"\t\n",
"biolite.utils.safe_mkdir: creating directory 'clusters'\n",
"agalma.database.select_homology_models: found the following taxa for homology id 17:\n",
" Hydra_magnipapillata (NCBI_HYDMAG)\n",
" Nematostella_vectensis (JGI_NEMVEC)\n",
" Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n",
" Craseoa_lathetica (SRX288432)\n",
" Physalia_physalis (SRX288431)\n",
" Nanomia_bijuga (SRX288430)\n",
" Agalma_elegans (SRX288285)\n",
"biolite.pipeline.run: \n",
" STAGE 2 / multalign.align_sequences / 0.157s / 116.5MB\n",
" Align sequences within each component\n",
"biolite.utils.safe_mkdir: creating directory 'alignments'\n",
"biolite.pipeline.run: \n",
" STAGE 3 / multalign.cleanup_alignments / 4.011s / 288.1MB\n",
" Clean up aligned sequences with Gblocks\n",
"biolite.pipeline.run: \n",
" STAGE 4 / multalign.parse_alignments / 4.458s / 288.1MB\n",
" Parse the cleaned sequences into the database\n",
"biolite.pipeline.run: \n",
" FINISHED / 4.470s / 288.5MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/genetree-19'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / genetree.init / 0.105s / 108.9MB\n",
" Find alignments in database\n",
"biolite.pipeline.run: \n",
" STAGE 1 / genetree.genetrees / 0.109s / 109.1MB\n",
" Build gene trees from alignments\n",
"biolite.utils.safe_mkdir: creating directory 'alignments'\n",
"biolite.utils.safe_mkdir: creating directory 'trees'\n",
"biolite.pipeline.run: \n",
" STAGE 2 / genetree.parse / 11.857s / 127.8MB\n",
" Parse the trees into the database. Check for jobs that timed out.\n",
"biolite.pipeline.run: \n",
" FINISHED / 11.859s / 127.9MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/treeprune-20'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / treeprune.init / 0.476s / 123.5MB\n",
" Determine path to input trees\n",
"biolite.pipeline.run: \n",
" STAGE 1 / treeprune.prune_trees / 0.477s / 123.5MB\n",
" Prune each tree using monophyly masking and paralogy pruning\n",
"biolite.pipeline.run: \n",
" STAGE 2 / treeprune.parse_trees / 0.503s / 124.1MB\n",
" Parse the tips of each tree to create a cluster in the database\n",
"__main__.parse_trees: histogram of gene cluster sizes:\n",
"4\t:\t3\n",
"5\t:\t2\n",
"biolite.pipeline.run: \n",
" FINISHED / 0.514s / 124.6MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-21'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / multalign.init / 0.062s / 115.4MB\n",
" Locate a previous homology or treeprune run\n",
"__main__.init: using previous 'treeprune' run id 20\n",
"biolite.pipeline.run: \n",
" STAGE 1 / multalign.select_clusters / 0.066s / 115.5MB\n",
" \n",
"\tSelect a cluster for each homologize component that meets size, sequence\n",
"\tlength, and composition requirements\n",
"\t\n",
"biolite.utils.safe_mkdir: creating directory 'clusters'\n",
"agalma.database.select_homology_models: found the following taxa for homology id 20:\n",
" Agalma_elegans (SRX288285)\n",
" Nematostella_vectensis (JGI_NEMVEC)\n",
" Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n",
" Craseoa_lathetica (SRX288432)\n",
" Hydra_magnipapillata (NCBI_HYDMAG)\n",
" Nanomia_bijuga (SRX288430)\n",
" Physalia_physalis (SRX288431)\n",
"biolite.pipeline.run: \n",
" STAGE 2 / multalign.align_sequences / 0.074s / 116.2MB\n",
" Align sequences within each component\n",
"biolite.utils.safe_mkdir: creating directory 'alignments'\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.pipeline.run: \n",
" STAGE 3 / multalign.cleanup_alignments / 3.275s / 248.8MB\n",
" Clean up aligned sequences with Gblocks\n",
"biolite.pipeline.run: \n",
" STAGE 4 / multalign.parse_alignments / 3.738s / 248.8MB\n",
" Parse the cleaned sequences into the database\n",
"__main__.parse_alignments: dropping sequence Physalia_physalis@64 in cluster 21\n",
"__main__.parse_alignments: dropping cluster 21\n",
"biolite.pipeline.run: \n",
" FINISHED / 3.753s / 249.1MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/supermatrix-22'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / supermatrix.init / 0.127s / 127.5MB\n",
" Find alignments in database\n",
"biolite.pipeline.run: \n",
" STAGE 1 / supermatrix.supermatrix / 0.134s / 128.3MB\n",
" Concatenate multiple alignments into a supermatrix\n",
"biolite.pipeline.run: \n",
" STAGE 2 / supermatrix.trim / 0.137s / 128.3MB\n",
" Trim the supermatrix to the specified proportion of occupancy\n",
"__main__.trim: no proportion specified... skipping\n",
"biolite.pipeline.run: \n",
" STAGE 3 / supermatrix.parse / 0.139s / 128.3MB\n",
" Store the supermatrix in the database\n",
"biolite.pipeline.run: \n",
" FINISHED / 0.144s / 128.5MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/speciestree-23'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / speciestree.init / 0.111s / 124.0MB\n",
" Find supermatrix in database\n",
"biolite.pipeline.run: \n",
" STAGE 1 / speciestree.speciestree / 0.113s / 124.3MB\n",
" Build species tree with bootstraps\n",
"biolite.pipeline.run: \n",
" STAGE 2 / speciestree.parse / 2.081s / 155.0MB\n",
" Parse the tree into the database\n",
"__main__.parse: species tree:\n",
" /---------------------- Craseoa lathetica \n",
" /----------@ \n",
" | | /----------- Agalma elegans \n",
" /----------@ \\----------@ \n",
" | | \\----------- Nanomia bijuga \n",
"/----------@ | \n",
"| | \\--------------------------------- Physalia physalis \n",
"@ | \n",
"| \\-------------------------------------------- Hydra magnipapillata \n",
"| \n",
"\\------------------------------------------------------- Nematostella vectensis\n",
" \n",
" \n",
"biolite.pipeline.run: \n",
" FINISHED / 2.087s / 155.2MB\n"
]
}
],
"source": [
"!agalma multalign --id PhylogenyTest\n",
"!agalma genetree --id PhylogenyTest\n",
"!agalma treeinform --id PhylogenyTest\n",
"!agalma homologize --id PhylogenyTest\n",
"!agalma multalign --id PhylogenyTest\n",
"!agalma genetree --id PhylogenyTest\n",
"!agalma treeprune --id PhylogenyTest\n",
"!agalma multalign --id PhylogenyTest\n",
"!agalma supermatrix --id PhylogenyTest\n",
"!agalma speciestree --id PhylogenyTest --outgroup Nematostella_vectensis"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest'\n",
"agalma.agalma_report.report_runs: no catalog entry found for id 'PhylogenyTest'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/css'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/img'\n",
"agalma.agalma_report.report_runs: 12 has pipelines: homologize\n",
"agalma.agalma_report.report_runs: added homologize report for 12\n",
"/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.\n",
" warnings.warn(\"No labelled objects found. \"\n",
"agalma.agalma_report.report_runs: 13 has pipelines: homologize\n",
"agalma.agalma_report.report_runs: added homologize report for 13\n",
"agalma.agalma_report.report_runs: 14 has pipelines: multalign\n",
"agalma.agalma_report.report_runs: added multalign report for 14\n",
"agalma.agalma_report.report_runs: 15 has pipelines: genetree\n",
"agalma.agalma_report.report_runs: added genetree report for 15\n",
"agalma.agalma_report.report_runs: 16 has pipelines: treeinform\n",
"agalma.agalma_report.report_runs: 17 has pipelines: homologize\n",
"agalma.agalma_report.report_runs: added homologize report for 17\n",
"agalma.agalma_report.report_runs: 18 has pipelines: multalign\n",
"agalma.agalma_report.report_runs: added multalign report for 18\n",
"agalma.agalma_report.report_runs: 19 has pipelines: genetree\n",
"agalma.agalma_report.report_runs: added genetree report for 19\n",
"agalma.agalma_report.report_runs: 20 has pipelines: treeprune\n",
"agalma.agalma_report.report_runs: added treeprune report for 20\n",
"agalma.agalma_report.report_runs: 21 has pipelines: multalign\n",
"agalma.agalma_report.report_runs: added multalign report for 21\n",
"agalma.agalma_report.report_runs: 22 has pipelines: supermatrix\n",
"agalma.agalma_report.report_runs: added supermatrix report for 22\n",
"agalma.agalma_report.report_runs: 23 has pipelines: speciestree\n",
"agalma.agalma_report.report_runs: added speciestree report for 23\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/js'\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/js' already exists\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/js' already exists\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: directory 'reports/PhylogenyTest' already exists\n",
"/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'Arial'] not found. Falling back to DejaVu Sans\n",
" (prop.get_family(), self.defaultFamily[fontext]))\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/css' already exists\n",
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/img' already exists\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: directory 'reports/PhylogenyTest' already exists\n",
"Saved figure to '/home/ljcohen/reports/PhylogenyTest/PhylogenyTest.pdf'\n"
]
}
],
"source": [
"!agalma report --id PhylogenyTest --outdir reports/PhylogenyTest\n",
"!agalma resources --id PhylogenyTest --outdir reports/PhylogenyTest\n",
"!agalma phylogeny_report --id PhylogenyTest --outdir reports/PhylogenyTest"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"DONE RUN CATALOG_ID NAME HOSTNAME USERNAME TIMESTAMP HID\n",
"* 1 HWI-ST625-73-C0JUVACXX-7 qc js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:03:05.395168 \n",
"* 2 HWI-ST625-73-C0JUVACXX-7 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:03:18.557161 \n",
"* 3 SRX288285 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:18:02.276951 \n",
"* 4 SRX288430 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:23:35.449028 \n",
"* 5 SRX288431 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:26:14.802972 \n",
"* 6 SRX288432 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:28:08.248150 \n",
"* 7 JGI_NEMVEC import js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:33:35.147099 \n",
"* 8 JGI_NEMVEC translate js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:33:37.988784 \n",
"* 9 JGI_NEMVEC annotate js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:34:33.933559 \n",
"* 10 NCBI_HYDMAG import js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:35:17.724920 \n",
"* 11 NCBI_HYDMAG annotate js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:35:47.577219 \n",
"* 12 PhylogenyTest homologize js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:36:55.574199 \n",
"* 13 PhylogenyTest homologize js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:37:09.988285 \n",
"* 14 PhylogenyTest multalign js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:37:52.227894 \n",
"* 15 PhylogenyTest genetree js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:37:58.123674 \n",
"* 16 PhylogenyTest treeinform js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:09.355908 \n",
"* 17 PhylogenyTest homologize js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:11.042958 \n",
"* 18 PhylogenyTest multalign js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:13.203888 \n",
"* 19 PhylogenyTest genetree js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:18.624530 \n",
"* 20 PhylogenyTest treeprune js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:33.165721 \n",
"* 21 PhylogenyTest multalign js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:34.677996 \n",
"* 22 PhylogenyTest supermatrix js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:39.458883 \n",
"* 23 PhylogenyTest speciestree js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:40.716004 \n"
]
}
],
"source": [
"!agalma diagnostics list"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"SRX033366 [2018-03-05 21:58:41]\n",
"/home/ljcohen/SRX033366.fq (1.7 MB)\n",
" species: Nanomia bijuga\n",
" ncbi_id: 168759\n",
" itis_id: 51389\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: specimen-1\n",
" treatment: gastrozooids\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n",
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n",
"SRX036876 [2018-03-05 21:58:42]\n",
"/home/ljcohen/SRX036876.fq (1.9 MB)\n",
" species: Nanomia bijuga\n",
" ncbi_id: 168759\n",
" itis_id: 51389\n",
" extraction_id: None\n",
" library_id: None\n",
" library_type: None\n",
" individual: specimen-2\n",
" treatment: gastrozooids\n",
" sequencer: None\n",
" seq_center: None\n",
" note: None\n",
" sample_prep: None\n"
]
}
],
"source": [
"!cd ~/agalma/data\n",
"!agalma catalog insert --id SRX033366 --paths SRX033366.fq --species \"Nanomia bijuga\" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-1\n",
"!agalma catalog insert --id SRX036876 --paths SRX036876.fq --species \"Nanomia bijuga\" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-2"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-40'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / qc.setup_data / 0.146s / 108.7MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / qc.fastqc / 0.148s / 108.7MB\n",
" Generate FastQC reports for each FASTQ file\n",
"biolite.pipeline.run: \n",
" STAGE 2 / qc.parse / 3.690s / 385.1MB\n",
" Parse FastQC reports into the database\n",
"biolite.pipeline.run: \n",
" FINISHED / 3.700s / 386.9MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-41'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / qc.setup_data / 0.082s / 108.4MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / qc.fastqc / 0.085s / 108.4MB\n",
" Generate FastQC reports for each FASTQ file\n",
"biolite.pipeline.run: \n",
" STAGE 2 / qc.parse / 3.579s / 388.6MB\n",
" Parse FastQC reports into the database\n",
"biolite.pipeline.run: \n",
" FINISHED / 3.587s / 390.6MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/expression-42'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / expression.setup_data / 0.059s / 123.3MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / expression.setup_reference / 0.061s / 123.3MB\n",
" Locate reference sequences in the Agalma database\n",
"__main__.setup_reference: using previous 'transcriptome' run id 4\n",
"biolite.pipeline.run: \n",
" STAGE 2 / expression.calculate / 0.068s / 124.2MB\n",
" Calculate gene and isoform expression with RSEM\n",
"biolite.pipeline.run: \n",
" STAGE 3 / expression.parse_counts / 1.359s / 186.9MB\n",
" Parse gene-level counts into Agalma database\n",
"biolite.pipeline.run: \n",
" FINISHED / 1.363s / 187.1MB\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/expression-43'\n",
"biolite.pipeline.run: Starting at stage 0\n",
"biolite.pipeline.run: \n",
" STAGE 0 / expression.setup_data / 0.084s / 123.2MB\n",
" Setup paths to the FASTQ input sequence data\n",
"biolite.pipeline.setup_data: reading data from paths in catalog\n",
"biolite.pipeline.run: \n",
" STAGE 1 / expression.setup_reference / 0.085s / 123.2MB\n",
" Locate reference sequences in the Agalma database\n",
"__main__.setup_reference: using previous 'transcriptome' run id 4\n",
"biolite.pipeline.run: \n",
" STAGE 2 / expression.calculate / 0.093s / 124.1MB\n",
" Calculate gene and isoform expression with RSEM\n",
"biolite.pipeline.run: \n",
" STAGE 3 / expression.parse_counts / 1.179s / 186.8MB\n",
" Parse gene-level counts into Agalma database\n",
"biolite.pipeline.run: \n",
" FINISHED / 1.183s / 187.0MB\n"
]
}
],
"source": [
"!cd ~/agalma/scratch\n",
"!agalma qc --id SRX033366\n",
"!agalma qc --id SRX036876\n",
"!agalma expression --id SRX033366 SRX288430\n",
"!agalma expression --id SRX036876 SRX288430"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366/css'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366/img'\n",
"agalma.agalma_report.report_runs: skipping unfinished run 24\n",
"agalma.agalma_report.report_runs: skipping unfinished run 26\n",
"agalma.agalma_report.report_runs: skipping unfinished run 28\n",
"agalma.agalma_report.report_runs: skipping unfinished run 30\n",
"agalma.agalma_report.report_runs: skipping unfinished run 32\n",
"agalma.agalma_report.report_runs: skipping unfinished run 34\n",
"agalma.agalma_report.report_runs: skipping unfinished run 36\n",
"agalma.agalma_report.report_runs: skipping unfinished run 38\n",
"agalma.agalma_report.report_runs: 40 has pipelines: qc\n",
"agalma.agalma_report.report_runs: added qc report for 40\n",
"agalma.agalma_report.report_runs: 42 has pipelines: expression\n",
"agalma.agalma_report.report_runs: added expression report for 42\n",
"biolite.config.parse_env_resources: threads=6\n",
"biolite.config.parse_env_resources: memory=14441M\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876/css'\n",
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876/img'\n",
"agalma.agalma_report.report_runs: skipping unfinished run 25\n",
"agalma.agalma_report.report_runs: skipping unfinished run 27\n",
"agalma.agalma_report.report_runs: skipping unfinished run 29\n",
"agalma.agalma_report.report_runs: skipping unfinished run 31\n",
"agalma.agalma_report.report_runs: skipping unfinished run 33\n",
"agalma.agalma_report.report_runs: skipping unfinished run 35\n",
"agalma.agalma_report.report_runs: skipping unfinished run 37\n",
"agalma.agalma_report.report_runs: skipping unfinished run 39\n",
"agalma.agalma_report.report_runs: 41 has pipelines: qc\n",
"agalma.agalma_report.report_runs: added qc report for 41\n",
"agalma.agalma_report.report_runs: 43 has pipelines: expression\n",
"agalma.agalma_report.report_runs: added expression report for 43\n"
]
}
],
"source": [
"!agalma report --id SRX033366 --outdir reports/SRX033366\n",
"!agalma report --id SRX036876 --outdir reports/SRX036876"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# To download results to local computer:\n",
"\n",
"```\n",
"scp -r ljcohen@149.165.169.78:/home/ljcohen/reports/ ~/Documents/agalma_tutorial/\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment