Last active
March 6, 2018 16:46
-
-
Save johnsolk/2d8f82216dc35804952cf8d44ec39722 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Exploring a phylotranscriptmics workflow with [agalma](https://bitbucket.org/caseywdunn/agalma)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"* [Agalma tutorial](https://bitbucket.org/caseywdunn/agalma/src/master/TUTORIAL.md) by the [Dunn lab](http://dunnlab.org/)\n", | |
"* Followed \"Quick Install - Anaconda Python\" [Installation instructions](https://bitbucket.org/caseywdunn/agalma)\n", | |
"* Started an `m1.medium` instance (CPU: 6, Mem: 16 GB, Disk: 60 GB) with Ubuntu 16.04 on [Jetstream](https://use.jetstream-cloud.org/application/images)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Install [jupyter notebook](https://github.com/ngs-docs/2018-ggg201b/blob/master/lab5-assembly-eval/README.md) on the instance\n", | |
"\n", | |
"```\n", | |
"pip install jupyter\n", | |
"```\n", | |
"Then\n", | |
"```\n", | |
"jupyter notebook --generate-config\n", | |
"```\n", | |
"Then generate a config file. (Note: this password protects the notebook.)\n", | |
"```\n", | |
"cat >> ~/.jupyter/jupyter_notebook_config.py <<EOF\n", | |
"c = get_config()\n", | |
"c.NotebookApp.ip = '*'\n", | |
"c.NotebookApp.open_browser = False\n", | |
"c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'\n", | |
"c.NotebookApp.port = 8000\n", | |
"\n", | |
"EOF\n", | |
"```\n", | |
"Now, run it!\n", | |
"\n", | |
"```\n", | |
"jupyter notebook &\n", | |
"\n", | |
"```\n", | |
"(Press `Enter`, will return to commandline)\n", | |
"\n", | |
"You can figure out what Web address to connect to this way:\n", | |
"\n", | |
"```\n", | |
"echo http://$(hostname):8000/\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Test to make sure agalma installation worked:\n", | |
"```\n", | |
"mkdir ~/tmp\n", | |
"cd ~/tmp\n", | |
"agalma test\n", | |
"```\n", | |
"[Test ran successfully.](https://gist.github.com/ljcohen/0aa14000bfbfec62fffe4893684e1bb6)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Run the [agalma tutorial](https://bitbucket.org/caseywdunn/agalma/src/master/TUTORIAL.md?fileviewer=file-view-default)\n", | |
"Download test data:\n", | |
"```\n", | |
"cd agalma/data\n", | |
"agalma testdata\n", | |
"```\n", | |
"This output filenames. \n", | |
"\n", | |
"Default threads and memory on the `m1.medium` machine:\n", | |
"```\n", | |
"agalma -t 4 -m 14G\n", | |
"```\n", | |
"Then proceeded with tutorial steps:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"HWI-ST625-73-C0JUVACXX-7 [2018-03-05 20:43:28]\n", | |
"/home/ljcohen/SRX288285_1.fq (17.3 MB)\n", | |
"/home/ljcohen/SRX288285_2.fq (17.3 MB)\n", | |
" species: Agalma elegans\n", | |
" ncbi_id: 316166\n", | |
" itis_id: 51383\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma catalog insert --paths SRX288285_1.fq SRX288285_2.fq --species \"Agalma elegans\" --ncbi_id 316166 --itis_id 51383" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## QC" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-1'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / qc.setup_data / 0.189s / 110.4MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / qc.fastqc / 0.191s / 110.6MB\n", | |
" Generate FastQC reports for each FASTQ file\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / qc.parse / 8.276s / 543.4MB\n", | |
" Parse FastQC reports into the database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 8.290s / 547.2MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!cd ~/agalma/scratch\n", | |
"!agalma qc --id HWI-ST625-73-C0JUVACXX-7" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Transcriptome subset assembly and exmplar contig identification" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-2'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / insert_size.setup_data / -0.552s / 124.0MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / insert_size.assemble_subset / -0.550s / 124.0MB\n", | |
" Assemble a subset of high quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / insert_size.estimate_insert / 51.403s / 1515.7MB\n", | |
" Estimate insert size by mapping the subset against the assembly\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / rrna.assemble_subsets / 57.108s / 1516.1MB\n", | |
" Assemble subsets of increasing numbers of reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / rrna.blast_transcripts / 182.927s / 1517.6MB\n", | |
" Blast transcripts against known rRNA database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / rrna.find_exemplars / 183.321s / 1517.7MB\n", | |
" Parse blast output for exemplar rRNA sequences\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / rrna.map_reads / 183.344s / 1517.9MB\n", | |
" Map reads against rRNA exemplars\n", | |
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 7 / rrna.exclude_reads / 183.346s / 1517.9MB\n", | |
" Exclude pairs where either read maps to an rRNA exemplar\n", | |
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 8 / transcriptome.assemble_connector / 183.348s / 1517.9MB\n", | |
" [connector between \"rrna\" and \"assemble\"]\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 9 / assemble.setup_rrna / 183.349s / 1517.9MB\n", | |
" Retrieve the rRNA exemplars from the database\n", | |
"agalma.assemble.setup_rrna: no previous rrna run found for id HWI-ST625-73-C0JUVACXX-7\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 10 / assemble.filter_data / 183.351s / 1518.0MB\n", | |
" Filter out low-quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 11 / assemble.assemble / 186.372s / 1518.0MB\n", | |
" Assemble the filtered reads with Trinity\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 12 / assemble.parse_assembly / 258.908s / 1576.5MB\n", | |
" Parse the assembly into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 13 / assemble.remove_vectors / 258.922s / 1577.3MB\n", | |
" Remove vector contaminants with UniVec\n", | |
"biolite.utils.safe_mkdir: creating directory 'univec'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-2/univec' already exists\n", | |
"agalma.assemble.remove_vectors: found 0 vector contaminants\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 14 / assemble.remove_rrna / 259.470s / 1577.5MB\n", | |
" Remove rRNA using curated and exemplar sequences\n", | |
"biolite.utils.safe_mkdir: creating directory 'rrna'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-2/rrna' already exists\n", | |
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 15 / assemble.estimate_confidence / 260.139s / 1577.5MB\n", | |
" Estimate coverage and confidence values for each transcript\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 16 / assemble.parse_confidence / 268.327s / 1577.5MB\n", | |
" Parse estimated confidence scores and update database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 17 / transcriptome.write_sequences / 268.331s / 1577.5MB\n", | |
" Write assembled sequences to FASTA\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 18 / translate.identify_orfs / 268.334s / 1577.5MB\n", | |
" Identify long open reading frames\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 19 / translate.annotate_orfs / 268.671s / 1577.5MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-2/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 20 / translate.select_orfs / 350.139s / 1577.5MB\n", | |
" Select the open reading frame with the best evalue\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 350.153s / 1577.5MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma transcriptome --id HWI-ST625-73-C0JUVACXX-7" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/css'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/img'\n", | |
"agalma.agalma_report.report_runs: 1 has pipelines: qc\n", | |
"agalma.agalma_report.report_runs: added qc report for 1\n", | |
"agalma.agalma_report.report_runs: 2 has pipelines: assemble,translate,rrna,transcriptome,insert_size\n", | |
"agalma.agalma_report.report_runs: added insert_size report for 2\n", | |
"agalma.agalma_report.report_runs: added rrna report for 2\n", | |
"agalma.agalma_report.report_runs: added assemble report for 2\n", | |
"agalma.agalma_report.report_runs: added translate report for 2\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma report --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: directory 'reports/HWI-ST625-73-C0JUVACXX-7' already exists\n", | |
"/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'Arial'] not found. Falling back to DejaVu Sans\n", | |
" (prop.get_family(), self.defaultFamily[fontext]))\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/css' already exists\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/HWI-ST625-73-C0JUVACXX-7/img' already exists\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma resources --id HWI-ST625-73-C0JUVACXX-7 --outdir reports/HWI-ST625-73-C0JUVACXX-7" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"SRX288285 [2018-03-05 21:17:25]\n", | |
"/home/ljcohen/SRX288285_1.fq (17.3 MB)\n", | |
"/home/ljcohen/SRX288285_2.fq (17.3 MB)\n", | |
" species: Agalma elegans\n", | |
" ncbi_id: 316166\n", | |
" itis_id: None\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n", | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"SRX288432 [2018-03-05 21:17:26]\n", | |
"/home/ljcohen/SRX288432_1.fq (2.5 MB)\n", | |
"/home/ljcohen/SRX288432_2.fq (2.5 MB)\n", | |
" species: Craseoa lathetica\n", | |
" ncbi_id: 316205\n", | |
" itis_id: None\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n", | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"SRX288431 [2018-03-05 21:17:26]\n", | |
"/home/ljcohen/SRX288431_1.fq (252 KB)\n", | |
"/home/ljcohen/SRX288431_2.fq (252 KB)\n", | |
" species: Physalia physalis\n", | |
" ncbi_id: 168775\n", | |
" itis_id: None\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n", | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"SRX288430 [2018-03-05 21:17:27]\n", | |
"/home/ljcohen/SRX288430_1.fq (757 KB)\n", | |
"/home/ljcohen/SRX288430_2.fq (757 KB)\n", | |
" species: Nanomia bijuga\n", | |
" ncbi_id: 168759\n", | |
" itis_id: None\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n", | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"JGI_NEMVEC [2018-03-05 21:17:28]\n", | |
"/home/ljcohen/JGI_NEMVEC.fa (22 KB)\n", | |
" species: Nematostella vectensis\n", | |
" ncbi_id: 45351\n", | |
" itis_id: None\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n", | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"NCBI_HYDMAG [2018-03-05 21:17:29]\n", | |
"/home/ljcohen/NCBI_HYDMAG.pfa (7 KB)\n", | |
" species: Hydra magnipapillata\n", | |
" ncbi_id: 6085\n", | |
" itis_id: None\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: None\n", | |
" treatment: None\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n" | |
] | |
} | |
], | |
"source": [ | |
"!cd ~/agalma/data\n", | |
"!agalma catalog insert --id SRX288285 --paths SRX288285_1.fq SRX288285_2.fq --species \"Agalma elegans\" --ncbi_id 316166\n", | |
"!agalma catalog insert --id SRX288432 --paths SRX288432_1.fq SRX288432_2.fq --species \"Craseoa lathetica\" --ncbi_id 316205\n", | |
"!agalma catalog insert --id SRX288431 --paths SRX288431_1.fq SRX288431_2.fq --species \"Physalia physalis\" --ncbi_id 168775\n", | |
"!agalma catalog insert --id SRX288430 --paths SRX288430_1.fq SRX288430_2.fq --species \"Nanomia bijuga\" --ncbi_id 168759\n", | |
"!agalma catalog insert --id JGI_NEMVEC --paths JGI_NEMVEC.fa --species \"Nematostella vectensis\" --ncbi_id 45351\n", | |
"!agalma catalog insert --id NCBI_HYDMAG --paths NCBI_HYDMAG.pfa --species \"Hydra magnipapillata\" --ncbi_id 6085" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-3'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / insert_size.setup_data / 0.127s / 123.8MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / insert_size.assemble_subset / 0.130s / 124.0MB\n", | |
" Assemble a subset of high quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / insert_size.estimate_insert / 68.322s / 1568.6MB\n", | |
" Estimate insert size by mapping the subset against the assembly\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / rrna.assemble_subsets / 74.084s / 1568.9MB\n", | |
" Assemble subsets of increasing numbers of reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / rrna.blast_transcripts / 201.843s / 1569.4MB\n", | |
" Blast transcripts against known rRNA database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / rrna.find_exemplars / 202.224s / 1569.4MB\n", | |
" Parse blast output for exemplar rRNA sequences\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / rrna.map_reads / 202.258s / 1569.7MB\n", | |
" Map reads against rRNA exemplars\n", | |
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 7 / rrna.exclude_reads / 202.260s / 1569.7MB\n", | |
" Exclude pairs where either read maps to an rRNA exemplar\n", | |
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 8 / transcriptome.assemble_connector / 202.263s / 1569.7MB\n", | |
" [connector between \"rrna\" and \"assemble\"]\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 9 / assemble.setup_rrna / 202.264s / 1569.7MB\n", | |
" Retrieve the rRNA exemplars from the database\n", | |
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288285\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 10 / assemble.filter_data / 202.267s / 1569.7MB\n", | |
" Filter out low-quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 11 / assemble.assemble / 205.101s / 1569.8MB\n", | |
" Assemble the filtered reads with Trinity\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 12 / assemble.parse_assembly / 255.274s / 1569.8MB\n", | |
" Parse the assembly into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 13 / assemble.remove_vectors / 255.288s / 1570.7MB\n", | |
" Remove vector contaminants with UniVec\n", | |
"biolite.utils.safe_mkdir: creating directory 'univec'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-3/univec' already exists\n", | |
"agalma.assemble.remove_vectors: found 0 vector contaminants\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 14 / assemble.remove_rrna / 255.858s / 1570.9MB\n", | |
" Remove rRNA using curated and exemplar sequences\n", | |
"biolite.utils.safe_mkdir: creating directory 'rrna'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-3/rrna' already exists\n", | |
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 15 / assemble.estimate_confidence / 256.476s / 1570.9MB\n", | |
" Estimate coverage and confidence values for each transcript\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 16 / assemble.parse_confidence / 265.440s / 1570.9MB\n", | |
" Parse estimated confidence scores and update database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 17 / transcriptome.write_sequences / 265.444s / 1570.9MB\n", | |
" Write assembled sequences to FASTA\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 18 / translate.identify_orfs / 265.448s / 1570.9MB\n", | |
" Identify long open reading frames\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 19 / translate.annotate_orfs / 265.697s / 1570.9MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-3/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 20 / translate.select_orfs / 332.014s / 1570.9MB\n", | |
" Select the open reading frame with the best evalue\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 332.025s / 1570.9MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-4'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / insert_size.setup_data / 0.138s / 124.0MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / insert_size.assemble_subset / 0.141s / 124.2MB\n", | |
" Assemble a subset of high quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / insert_size.estimate_insert / 13.886s / 719.2MB\n", | |
" Estimate insert size by mapping the subset against the assembly\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / rrna.assemble_subsets / 14.671s / 719.5MB\n", | |
" Assemble subsets of increasing numbers of reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / rrna.blast_transcripts / 108.762s / 722.0MB\n", | |
" Blast transcripts against known rRNA database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / rrna.find_exemplars / 109.173s / 722.0MB\n", | |
" Parse blast output for exemplar rRNA sequences\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / rrna.map_reads / 109.213s / 722.4MB\n", | |
" Map reads against rRNA exemplars\n", | |
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 7 / rrna.exclude_reads / 109.215s / 722.4MB\n", | |
" Exclude pairs where either read maps to an rRNA exemplar\n", | |
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 8 / transcriptome.assemble_connector / 109.218s / 722.4MB\n", | |
" [connector between \"rrna\" and \"assemble\"]\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 9 / assemble.setup_rrna / 109.219s / 722.4MB\n", | |
" Retrieve the rRNA exemplars from the database\n", | |
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288430\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 10 / assemble.filter_data / 109.222s / 722.4MB\n", | |
" Filter out low-quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 11 / assemble.assemble / 109.432s / 722.5MB\n", | |
" Assemble the filtered reads with Trinity\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 12 / assemble.parse_assembly / 121.188s / 722.5MB\n", | |
" Parse the assembly into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 13 / assemble.remove_vectors / 121.197s / 723.5MB\n", | |
" Remove vector contaminants with UniVec\n", | |
"biolite.utils.safe_mkdir: creating directory 'univec'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-4/univec' already exists\n", | |
"agalma.assemble.remove_vectors: found 0 vector contaminants\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 14 / assemble.remove_rrna / 121.657s / 723.6MB\n", | |
" Remove rRNA using curated and exemplar sequences\n", | |
"biolite.utils.safe_mkdir: creating directory 'rrna'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-4/rrna' already exists\n", | |
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 15 / assemble.estimate_confidence / 122.221s / 723.7MB\n", | |
" Estimate coverage and confidence values for each transcript\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 16 / assemble.parse_confidence / 123.356s / 723.7MB\n", | |
" Parse estimated confidence scores and update database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 17 / transcriptome.write_sequences / 123.360s / 723.7MB\n", | |
" Write assembled sequences to FASTA\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 18 / translate.identify_orfs / 123.363s / 723.7MB\n", | |
" Identify long open reading frames\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.pipeline.run: \n", | |
" STAGE 19 / translate.annotate_orfs / 123.616s / 723.7MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-4/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 20 / translate.select_orfs / 156.639s / 723.7MB\n", | |
" Select the open reading frame with the best evalue\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 156.651s / 723.7MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-5'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / insert_size.setup_data / 0.307s / 123.9MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / insert_size.assemble_subset / 0.308s / 124.1MB\n", | |
" Assemble a subset of high quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / insert_size.estimate_insert / 11.024s / 719.2MB\n", | |
" Estimate insert size by mapping the subset against the assembly\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / rrna.assemble_subsets / 11.707s / 719.6MB\n", | |
" Assemble subsets of increasing numbers of reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / rrna.blast_transcripts / 79.731s / 720.1MB\n", | |
" Blast transcripts against known rRNA database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / rrna.find_exemplars / 80.083s / 720.1MB\n", | |
" Parse blast output for exemplar rRNA sequences\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / rrna.map_reads / 80.102s / 720.3MB\n", | |
" Map reads against rRNA exemplars\n", | |
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 7 / rrna.exclude_reads / 80.104s / 720.3MB\n", | |
" Exclude pairs where either read maps to an rRNA exemplar\n", | |
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 8 / transcriptome.assemble_connector / 80.106s / 720.3MB\n", | |
" [connector between \"rrna\" and \"assemble\"]\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 9 / assemble.setup_rrna / 80.107s / 720.3MB\n", | |
" Retrieve the rRNA exemplars from the database\n", | |
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288431\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 10 / assemble.filter_data / 80.109s / 720.3MB\n", | |
" Filter out low-quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 11 / assemble.assemble / 80.224s / 720.5MB\n", | |
" Assemble the filtered reads with Trinity\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 12 / assemble.parse_assembly / 93.242s / 720.5MB\n", | |
" Parse the assembly into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 13 / assemble.remove_vectors / 93.250s / 721.6MB\n", | |
" Remove vector contaminants with UniVec\n", | |
"biolite.utils.safe_mkdir: creating directory 'univec'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-5/univec' already exists\n", | |
"agalma.assemble.remove_vectors: found 0 vector contaminants\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 14 / assemble.remove_rrna / 93.776s / 721.7MB\n", | |
" Remove rRNA using curated and exemplar sequences\n", | |
"biolite.utils.safe_mkdir: creating directory 'rrna'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-5/rrna' already exists\n", | |
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 15 / assemble.estimate_confidence / 94.345s / 721.7MB\n", | |
" Estimate coverage and confidence values for each transcript\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 16 / assemble.parse_confidence / 95.076s / 721.7MB\n", | |
" Parse estimated confidence scores and update database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 17 / transcriptome.write_sequences / 95.079s / 721.7MB\n", | |
" Write assembled sequences to FASTA\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 18 / translate.identify_orfs / 95.083s / 721.7MB\n", | |
" Identify long open reading frames\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 19 / translate.annotate_orfs / 95.258s / 721.7MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-5/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 20 / translate.select_orfs / 112.402s / 721.7MB\n", | |
" Select the open reading frame with the best evalue\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 112.406s / 721.7MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-6'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / insert_size.setup_data / 0.130s / 124.1MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / insert_size.assemble_subset / 0.132s / 124.2MB\n", | |
" Assemble a subset of high quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / insert_size.estimate_insert / 24.007s / 734.4MB\n", | |
" Estimate insert size by mapping the subset against the assembly\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / rrna.assemble_subsets / 25.524s / 734.7MB\n", | |
" Assemble subsets of increasing numbers of reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / rrna.blast_transcripts / 156.028s / 781.3MB\n", | |
" Blast transcripts against known rRNA database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / rrna.find_exemplars / 156.463s / 781.3MB\n", | |
" Parse blast output for exemplar rRNA sequences\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: large-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target large-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: large-nuclear-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-mito-rRNA\n", | |
"agalma.rrna.find_exemplars: small-mito-rRNA not found in the assembly, skipping\n", | |
"agalma.rrna.find_exemplars: selecting an exemplar for gene target small-nuclear-rRNA\n", | |
"agalma.rrna.find_exemplars: small-nuclear-rRNA not found in the assembly, skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / rrna.map_reads / 156.515s / 782.0MB\n", | |
" Map reads against rRNA exemplars\n", | |
"agalma.rrna.map_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 7 / rrna.exclude_reads / 156.517s / 782.0MB\n", | |
" Exclude pairs where either read maps to an rRNA exemplar\n", | |
"agalma.rrna.exclude_reads: no rRNA exemplars were found... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 8 / transcriptome.assemble_connector / 156.519s / 782.0MB\n", | |
" [connector between \"rrna\" and \"assemble\"]\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 9 / assemble.setup_rrna / 156.520s / 782.0MB\n", | |
" Retrieve the rRNA exemplars from the database\n", | |
"agalma.assemble.setup_rrna: no previous rrna run found for id SRX288432\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 10 / assemble.filter_data / 156.522s / 782.1MB\n", | |
" Filter out low-quality reads\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 11 / assemble.assemble / 156.956s / 782.1MB\n", | |
" Assemble the filtered reads with Trinity\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 12 / assemble.parse_assembly / 177.411s / 782.1MB\n", | |
" Parse the assembly into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 13 / assemble.remove_vectors / 177.421s / 783.0MB\n", | |
" Remove vector contaminants with UniVec\n", | |
"biolite.utils.safe_mkdir: creating directory 'univec'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-6/univec' already exists\n", | |
"agalma.assemble.remove_vectors: found 0 vector contaminants\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 14 / assemble.remove_rrna / 177.858s / 783.2MB\n", | |
" Remove rRNA using curated and exemplar sequences\n", | |
"biolite.utils.safe_mkdir: creating directory 'rrna'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/transcriptome-6/rrna' already exists\n", | |
"agalma.assemble.remove_rrna: found 0 ribosomal RNAs\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 15 / assemble.estimate_confidence / 178.400s / 783.2MB\n", | |
" Estimate coverage and confidence values for each transcript\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.pipeline.run: \n", | |
" STAGE 16 / assemble.parse_confidence / 180.418s / 783.2MB\n", | |
" Parse estimated confidence scores and update database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 17 / transcriptome.write_sequences / 180.420s / 783.2MB\n", | |
" Write assembled sequences to FASTA\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 18 / translate.identify_orfs / 180.423s / 783.2MB\n", | |
" Identify long open reading frames\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 19 / translate.annotate_orfs / 180.730s / 783.2MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/transcriptome-6/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 20 / translate.select_orfs / 287.948s / 783.2MB\n", | |
" Select the open reading frame with the best evalue\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 287.961s / 783.2MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!cd ~/agalma/scratch\n", | |
"!agalma transcriptome --id SRX288285\n", | |
"!agalma transcriptome --id SRX288430\n", | |
"!agalma transcriptome --id SRX288431\n", | |
"!agalma transcriptome --id SRX288432" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/import-7'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / import.setup_paths / 1.686s / 115.4MB\n", | |
" Determine the paths to the FASTA files\n", | |
"__main__.setup_paths: found paths [u'/home/ljcohen/JGI_NEMVEC.fa']\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / import.parse_sequences / 1.688s / 115.4MB\n", | |
" Parse the sequences from the FASTA files\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 1.699s / 116.7MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/translate-8'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / translate.setup_sequences / 0.104s / 127.1MB\n", | |
" Locate a previous assemble or import run\n", | |
"__main__.setup_sequences: using previous 'import' run id 7\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / translate.identify_orfs / 0.107s / 127.6MB\n", | |
" Identify long open reading frames\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / translate.annotate_orfs / 0.275s / 154.9MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/translate-8/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / translate.select_orfs / 54.905s / 309.0MB\n", | |
" Select the open reading frame with the best evalue\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 54.917s / 309.2MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-9'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / annotate.setup_sequences / 0.137s / 123.9MB\n", | |
" Locate a previous import run\n", | |
"__main__.setup_sequences: using previous 'import' run id 7\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / annotate.annotate / 0.140s / 124.6MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-9/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / annotate.parse / 42.673s / 313.9MB\n", | |
" Parse the annotations into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 42.692s / 313.9MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma import --id JGI_NEMVEC\n", | |
"!agalma translate --id JGI_NEMVEC\n", | |
"!agalma annotate --id JGI_NEMVEC" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/import-10'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / import.setup_paths / 1.970s / 115.8MB\n", | |
" Determine the paths to the FASTA files\n", | |
"__main__.setup_paths: found paths [u'/home/ljcohen/NCBI_HYDMAG.pfa']\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / import.parse_sequences / 1.972s / 115.8MB\n", | |
" Parse the sequences from the FASTA files\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 1.983s / 117.3MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma import --id NCBI_HYDMAG --seq_type aa" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-11'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / annotate.setup_sequences / 0.147s / 123.8MB\n", | |
" Locate a previous import run\n", | |
"__main__.setup_sequences: using previous 'import' run id 10\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / annotate.annotate / 0.150s / 124.5MB\n", | |
" Blastp protein sequences against SwissProt\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/annotate-11/blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / annotate.parse / 56.932s / 307.5MB\n", | |
" Parse the annotations into the sequences table\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 56.945s / 307.5MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma annotate --id NCBI_HYDMAG" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/homologize-12'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / homologize.init / 0.315s / 123.7MB\n", | |
" Determine the version of gene entries to use and lookup species data\n", | |
"agalma.database.latest_genes_version: using default genes version 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / homologize.write_fasta / 0.320s / 124.2MB\n", | |
" Write sequences from the Agalma database to a FASTA file\n", | |
"biolite.utils.safe_mkdir: creating directory 'blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / homologize.prepare_blast / 0.324s / 125.0MB\n", | |
" Prepare all-by-all BLAST database and command list\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/homologize-12/blastp' already exists\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / homologize.run_blast / 0.433s / 156.6MB\n", | |
" Run all-by-all BLAST\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / homologize.parse_edges / 1.317s / 157.1MB\n", | |
" Parse BLAST hits into edges weighted by bitscore\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / homologize.mcl_cluster / 1.363s / 157.4MB\n", | |
" Run mcl on all-by-all graph to form gene clusters\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / homologize.load_mcl_cluster / 1.406s / 157.4MB\n", | |
" Load cluster file from mcl into homology database\n", | |
"__main__.load_mcl_cluster: histogram of gene cluster sizes:\n", | |
" 2\t:\t3\n", | |
" 3\t:\t3\n", | |
" 4\t:\t1\n", | |
" 5\t:\t2\n", | |
" 7\t:\t1\n", | |
" 9\t:\t1\n", | |
" 12\t:\t1\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 1.412s / 157.7MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma homologize --id PhylogenyTest" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-14'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / multalign.init / 0.098s / 115.4MB\n", | |
" Locate a previous homology or treeprune run\n", | |
"__main__.init: using previous 'homologize' run id 13\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / multalign.select_clusters / 0.100s / 115.5MB\n", | |
" \n", | |
"\tSelect a cluster for each homologize component that meets size, sequence\n", | |
"\tlength, and composition requirements\n", | |
"\t\n", | |
"biolite.utils.safe_mkdir: creating directory 'clusters'\n", | |
"agalma.database.select_homology_models: found the following taxa for homology id 13:\n", | |
" Agalma_elegans (SRX288285)\n", | |
" Nematostella_vectensis (JGI_NEMVEC)\n", | |
" Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n", | |
" Craseoa_lathetica (SRX288432)\n", | |
" Physalia_physalis (SRX288431)\n", | |
" Nanomia_bijuga (SRX288430)\n", | |
" Hydra_magnipapillata (NCBI_HYDMAG)\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / multalign.align_sequences / 0.107s / 116.2MB\n", | |
" Align sequences within each component\n", | |
"biolite.utils.safe_mkdir: creating directory 'alignments'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / multalign.cleanup_alignments / 4.124s / 290.8MB\n", | |
" Clean up aligned sequences with Gblocks\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / multalign.parse_alignments / 4.520s / 290.9MB\n", | |
" Parse the cleaned sequences into the database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 4.541s / 291.3MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/genetree-15'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / genetree.init / 0.351s / 108.9MB\n", | |
" Find alignments in database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / genetree.genetrees / 0.353s / 109.1MB\n", | |
" Build gene trees from alignments\n", | |
"biolite.utils.safe_mkdir: creating directory 'alignments'\n", | |
"biolite.utils.safe_mkdir: creating directory 'trees'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / genetree.parse / 9.839s / 127.7MB\n", | |
" Parse the trees into the database. Check for jobs that timed out.\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 9.841s / 127.8MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/treeinform-16'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / treeinform.init / 0.351s / 123.5MB\n", | |
" Determine path to input trees\n", | |
"__main__.init: found genetree run 15\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / treeinform.identify_candidate_variants / 0.354s / 123.6MB\n", | |
" Identify candidates variants\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / treeinform.reassign_genes / 0.370s / 123.6MB\n", | |
" Reassign candidate variants to the same gene\n", | |
"agalma.database.validate_genes: Validating model IDs:\n", | |
"\t\t unique model_id: 132\n", | |
"\t\t= all model_id: 132\n", | |
"agalma.database.validate_genes: Validating number of transcripts:\n", | |
"\t\t original assembly: 132\n", | |
"\t\t= revised assembly: 132\n", | |
"agalma.database.validate_genes: Validating number of genes:\n", | |
"\t\t original assembly: 62\n", | |
"\t\t- reassigned: 6\n", | |
"\t\t+ newly created: 3\n", | |
"\t\t= revised assembly: 59\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 0.379s / 124.1MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/homologize-17'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / homologize.init / 0.108s / 124.1MB\n", | |
" Determine the version of gene entries to use and lookup species data\n", | |
"agalma.database.latest_genes_version: using genes version 16\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / homologize.write_fasta / 0.112s / 124.6MB\n", | |
" Write sequences from the Agalma database to a FASTA file\n", | |
"biolite.utils.safe_mkdir: creating directory 'blastp'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / homologize.prepare_blast / 0.116s / 125.4MB\n", | |
" Prepare all-by-all BLAST database and command list\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/homologize-17/blastp' already exists\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / homologize.run_blast / 0.221s / 157.0MB\n", | |
" Run all-by-all BLAST\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / homologize.parse_edges / 1.102s / 157.6MB\n", | |
" Parse BLAST hits into edges weighted by bitscore\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 5 / homologize.mcl_cluster / 1.148s / 157.9MB\n", | |
" Run mcl on all-by-all graph to form gene clusters\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 6 / homologize.load_mcl_cluster / 1.186s / 157.9MB\n", | |
" Load cluster file from mcl into homology database\n", | |
"__main__.load_mcl_cluster: histogram of gene cluster sizes:\n", | |
" 2\t:\t3\n", | |
" 3\t:\t3\n", | |
" 4\t:\t2\n", | |
" 5\t:\t1\n", | |
" 6\t:\t1\n", | |
" 8\t:\t1\n", | |
" 12\t:\t1\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 1.191s / 158.2MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-18'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / multalign.init / 0.149s / 115.6MB\n", | |
" Locate a previous homology or treeprune run\n", | |
"__main__.init: using previous 'homologize' run id 17\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / multalign.select_clusters / 0.151s / 115.7MB\n", | |
" \n", | |
"\tSelect a cluster for each homologize component that meets size, sequence\n", | |
"\tlength, and composition requirements\n", | |
"\t\n", | |
"biolite.utils.safe_mkdir: creating directory 'clusters'\n", | |
"agalma.database.select_homology_models: found the following taxa for homology id 17:\n", | |
" Hydra_magnipapillata (NCBI_HYDMAG)\n", | |
" Nematostella_vectensis (JGI_NEMVEC)\n", | |
" Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n", | |
" Craseoa_lathetica (SRX288432)\n", | |
" Physalia_physalis (SRX288431)\n", | |
" Nanomia_bijuga (SRX288430)\n", | |
" Agalma_elegans (SRX288285)\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / multalign.align_sequences / 0.157s / 116.5MB\n", | |
" Align sequences within each component\n", | |
"biolite.utils.safe_mkdir: creating directory 'alignments'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / multalign.cleanup_alignments / 4.011s / 288.1MB\n", | |
" Clean up aligned sequences with Gblocks\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / multalign.parse_alignments / 4.458s / 288.1MB\n", | |
" Parse the cleaned sequences into the database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 4.470s / 288.5MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/genetree-19'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / genetree.init / 0.105s / 108.9MB\n", | |
" Find alignments in database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / genetree.genetrees / 0.109s / 109.1MB\n", | |
" Build gene trees from alignments\n", | |
"biolite.utils.safe_mkdir: creating directory 'alignments'\n", | |
"biolite.utils.safe_mkdir: creating directory 'trees'\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / genetree.parse / 11.857s / 127.8MB\n", | |
" Parse the trees into the database. Check for jobs that timed out.\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 11.859s / 127.9MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/treeprune-20'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / treeprune.init / 0.476s / 123.5MB\n", | |
" Determine path to input trees\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / treeprune.prune_trees / 0.477s / 123.5MB\n", | |
" Prune each tree using monophyly masking and paralogy pruning\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / treeprune.parse_trees / 0.503s / 124.1MB\n", | |
" Parse the tips of each tree to create a cluster in the database\n", | |
"__main__.parse_trees: histogram of gene cluster sizes:\n", | |
"4\t:\t3\n", | |
"5\t:\t2\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 0.514s / 124.6MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/multalign-21'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / multalign.init / 0.062s / 115.4MB\n", | |
" Locate a previous homology or treeprune run\n", | |
"__main__.init: using previous 'treeprune' run id 20\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / multalign.select_clusters / 0.066s / 115.5MB\n", | |
" \n", | |
"\tSelect a cluster for each homologize component that meets size, sequence\n", | |
"\tlength, and composition requirements\n", | |
"\t\n", | |
"biolite.utils.safe_mkdir: creating directory 'clusters'\n", | |
"agalma.database.select_homology_models: found the following taxa for homology id 20:\n", | |
" Agalma_elegans (SRX288285)\n", | |
" Nematostella_vectensis (JGI_NEMVEC)\n", | |
" Agalma_elegans (HWI-ST625-73-C0JUVACXX-7)\n", | |
" Craseoa_lathetica (SRX288432)\n", | |
" Hydra_magnipapillata (NCBI_HYDMAG)\n", | |
" Nanomia_bijuga (SRX288430)\n", | |
" Physalia_physalis (SRX288431)\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / multalign.align_sequences / 0.074s / 116.2MB\n", | |
" Align sequences within each component\n", | |
"biolite.utils.safe_mkdir: creating directory 'alignments'\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / multalign.cleanup_alignments / 3.275s / 248.8MB\n", | |
" Clean up aligned sequences with Gblocks\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 4 / multalign.parse_alignments / 3.738s / 248.8MB\n", | |
" Parse the cleaned sequences into the database\n", | |
"__main__.parse_alignments: dropping sequence Physalia_physalis@64 in cluster 21\n", | |
"__main__.parse_alignments: dropping cluster 21\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 3.753s / 249.1MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/supermatrix-22'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / supermatrix.init / 0.127s / 127.5MB\n", | |
" Find alignments in database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / supermatrix.supermatrix / 0.134s / 128.3MB\n", | |
" Concatenate multiple alignments into a supermatrix\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / supermatrix.trim / 0.137s / 128.3MB\n", | |
" Trim the supermatrix to the specified proportion of occupancy\n", | |
"__main__.trim: no proportion specified... skipping\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / supermatrix.parse / 0.139s / 128.3MB\n", | |
" Store the supermatrix in the database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 0.144s / 128.5MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/speciestree-23'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / speciestree.init / 0.111s / 124.0MB\n", | |
" Find supermatrix in database\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / speciestree.speciestree / 0.113s / 124.3MB\n", | |
" Build species tree with bootstraps\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / speciestree.parse / 2.081s / 155.0MB\n", | |
" Parse the tree into the database\n", | |
"__main__.parse: species tree:\n", | |
" /---------------------- Craseoa lathetica \n", | |
" /----------@ \n", | |
" | | /----------- Agalma elegans \n", | |
" /----------@ \\----------@ \n", | |
" | | \\----------- Nanomia bijuga \n", | |
"/----------@ | \n", | |
"| | \\--------------------------------- Physalia physalis \n", | |
"@ | \n", | |
"| \\-------------------------------------------- Hydra magnipapillata \n", | |
"| \n", | |
"\\------------------------------------------------------- Nematostella vectensis\n", | |
" \n", | |
" \n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 2.087s / 155.2MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma multalign --id PhylogenyTest\n", | |
"!agalma genetree --id PhylogenyTest\n", | |
"!agalma treeinform --id PhylogenyTest\n", | |
"!agalma homologize --id PhylogenyTest\n", | |
"!agalma multalign --id PhylogenyTest\n", | |
"!agalma genetree --id PhylogenyTest\n", | |
"!agalma treeprune --id PhylogenyTest\n", | |
"!agalma multalign --id PhylogenyTest\n", | |
"!agalma supermatrix --id PhylogenyTest\n", | |
"!agalma speciestree --id PhylogenyTest --outgroup Nematostella_vectensis" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest'\n", | |
"agalma.agalma_report.report_runs: no catalog entry found for id 'PhylogenyTest'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/css'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/img'\n", | |
"agalma.agalma_report.report_runs: 12 has pipelines: homologize\n", | |
"agalma.agalma_report.report_runs: added homologize report for 12\n", | |
"/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.\n", | |
" warnings.warn(\"No labelled objects found. \"\n", | |
"agalma.agalma_report.report_runs: 13 has pipelines: homologize\n", | |
"agalma.agalma_report.report_runs: added homologize report for 13\n", | |
"agalma.agalma_report.report_runs: 14 has pipelines: multalign\n", | |
"agalma.agalma_report.report_runs: added multalign report for 14\n", | |
"agalma.agalma_report.report_runs: 15 has pipelines: genetree\n", | |
"agalma.agalma_report.report_runs: added genetree report for 15\n", | |
"agalma.agalma_report.report_runs: 16 has pipelines: treeinform\n", | |
"agalma.agalma_report.report_runs: 17 has pipelines: homologize\n", | |
"agalma.agalma_report.report_runs: added homologize report for 17\n", | |
"agalma.agalma_report.report_runs: 18 has pipelines: multalign\n", | |
"agalma.agalma_report.report_runs: added multalign report for 18\n", | |
"agalma.agalma_report.report_runs: 19 has pipelines: genetree\n", | |
"agalma.agalma_report.report_runs: added genetree report for 19\n", | |
"agalma.agalma_report.report_runs: 20 has pipelines: treeprune\n", | |
"agalma.agalma_report.report_runs: added treeprune report for 20\n", | |
"agalma.agalma_report.report_runs: 21 has pipelines: multalign\n", | |
"agalma.agalma_report.report_runs: added multalign report for 21\n", | |
"agalma.agalma_report.report_runs: 22 has pipelines: supermatrix\n", | |
"agalma.agalma_report.report_runs: added supermatrix report for 22\n", | |
"agalma.agalma_report.report_runs: 23 has pipelines: speciestree\n", | |
"agalma.agalma_report.report_runs: added speciestree report for 23\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/PhylogenyTest/js'\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/js' already exists\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/js' already exists\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: directory 'reports/PhylogenyTest' already exists\n", | |
"/home/ljcohen/miniconda2/envs/agalma/lib/python2.7/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family [u'Arial'] not found. Falling back to DejaVu Sans\n", | |
" (prop.get_family(), self.defaultFamily[fontext]))\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/css' already exists\n", | |
"biolite.utils.safe_mkdir: directory '/home/ljcohen/reports/PhylogenyTest/img' already exists\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: directory 'reports/PhylogenyTest' already exists\n", | |
"Saved figure to '/home/ljcohen/reports/PhylogenyTest/PhylogenyTest.pdf'\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma report --id PhylogenyTest --outdir reports/PhylogenyTest\n", | |
"!agalma resources --id PhylogenyTest --outdir reports/PhylogenyTest\n", | |
"!agalma phylogeny_report --id PhylogenyTest --outdir reports/PhylogenyTest" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"DONE RUN CATALOG_ID NAME HOSTNAME USERNAME TIMESTAMP HID\n", | |
"* 1 HWI-ST625-73-C0JUVACXX-7 qc js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:03:05.395168 \n", | |
"* 2 HWI-ST625-73-C0JUVACXX-7 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:03:18.557161 \n", | |
"* 3 SRX288285 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:18:02.276951 \n", | |
"* 4 SRX288430 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:23:35.449028 \n", | |
"* 5 SRX288431 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:26:14.802972 \n", | |
"* 6 SRX288432 transcriptome js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:28:08.248150 \n", | |
"* 7 JGI_NEMVEC import js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:33:35.147099 \n", | |
"* 8 JGI_NEMVEC translate js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:33:37.988784 \n", | |
"* 9 JGI_NEMVEC annotate js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:34:33.933559 \n", | |
"* 10 NCBI_HYDMAG import js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:35:17.724920 \n", | |
"* 11 NCBI_HYDMAG annotate js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:35:47.577219 \n", | |
"* 12 PhylogenyTest homologize js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:36:55.574199 \n", | |
"* 13 PhylogenyTest homologize js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:37:09.988285 \n", | |
"* 14 PhylogenyTest multalign js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:37:52.227894 \n", | |
"* 15 PhylogenyTest genetree js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:37:58.123674 \n", | |
"* 16 PhylogenyTest treeinform js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:09.355908 \n", | |
"* 17 PhylogenyTest homologize js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:11.042958 \n", | |
"* 18 PhylogenyTest multalign js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:13.203888 \n", | |
"* 19 PhylogenyTest genetree js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:18.624530 \n", | |
"* 20 PhylogenyTest treeprune js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:33.165721 \n", | |
"* 21 PhylogenyTest multalign js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:34.677996 \n", | |
"* 22 PhylogenyTest supermatrix js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:39.458883 \n", | |
"* 23 PhylogenyTest speciestree js-169-78.jetstream-cloud.org ljcohen 2018-03-05T16:38:40.716004 \n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma diagnostics list" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"SRX033366 [2018-03-05 21:58:41]\n", | |
"/home/ljcohen/SRX033366.fq (1.7 MB)\n", | |
" species: Nanomia bijuga\n", | |
" ncbi_id: 168759\n", | |
" itis_id: 51389\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: specimen-1\n", | |
" treatment: gastrozooids\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n", | |
"biolite.config.parse_env_resources: database=/home/ljcohen/agalma/data/agalma.sqlite\n", | |
"SRX036876 [2018-03-05 21:58:42]\n", | |
"/home/ljcohen/SRX036876.fq (1.9 MB)\n", | |
" species: Nanomia bijuga\n", | |
" ncbi_id: 168759\n", | |
" itis_id: 51389\n", | |
" extraction_id: None\n", | |
" library_id: None\n", | |
" library_type: None\n", | |
" individual: specimen-2\n", | |
" treatment: gastrozooids\n", | |
" sequencer: None\n", | |
" seq_center: None\n", | |
" note: None\n", | |
" sample_prep: None\n" | |
] | |
} | |
], | |
"source": [ | |
"!cd ~/agalma/data\n", | |
"!agalma catalog insert --id SRX033366 --paths SRX033366.fq --species \"Nanomia bijuga\" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-1\n", | |
"!agalma catalog insert --id SRX036876 --paths SRX036876.fq --species \"Nanomia bijuga\" --ncbi_id 168759 --itis_id 51389 --treatment gastrozooids --individual specimen-2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-40'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / qc.setup_data / 0.146s / 108.7MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / qc.fastqc / 0.148s / 108.7MB\n", | |
" Generate FastQC reports for each FASTQ file\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / qc.parse / 3.690s / 385.1MB\n", | |
" Parse FastQC reports into the database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 3.700s / 386.9MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/qc-41'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / qc.setup_data / 0.082s / 108.4MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / qc.fastqc / 0.085s / 108.4MB\n", | |
" Generate FastQC reports for each FASTQ file\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / qc.parse / 3.579s / 388.6MB\n", | |
" Parse FastQC reports into the database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 3.587s / 390.6MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/expression-42'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / expression.setup_data / 0.059s / 123.3MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / expression.setup_reference / 0.061s / 123.3MB\n", | |
" Locate reference sequences in the Agalma database\n", | |
"__main__.setup_reference: using previous 'transcriptome' run id 4\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / expression.calculate / 0.068s / 124.2MB\n", | |
" Calculate gene and isoform expression with RSEM\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / expression.parse_counts / 1.359s / 186.9MB\n", | |
" Parse gene-level counts into Agalma database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 1.363s / 187.1MB\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/expression-43'\n", | |
"biolite.pipeline.run: Starting at stage 0\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 0 / expression.setup_data / 0.084s / 123.2MB\n", | |
" Setup paths to the FASTQ input sequence data\n", | |
"biolite.pipeline.setup_data: reading data from paths in catalog\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 1 / expression.setup_reference / 0.085s / 123.2MB\n", | |
" Locate reference sequences in the Agalma database\n", | |
"__main__.setup_reference: using previous 'transcriptome' run id 4\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 2 / expression.calculate / 0.093s / 124.1MB\n", | |
" Calculate gene and isoform expression with RSEM\n", | |
"biolite.pipeline.run: \n", | |
" STAGE 3 / expression.parse_counts / 1.179s / 186.8MB\n", | |
" Parse gene-level counts into Agalma database\n", | |
"biolite.pipeline.run: \n", | |
" FINISHED / 1.183s / 187.0MB\n" | |
] | |
} | |
], | |
"source": [ | |
"!cd ~/agalma/scratch\n", | |
"!agalma qc --id SRX033366\n", | |
"!agalma qc --id SRX036876\n", | |
"!agalma expression --id SRX033366 SRX288430\n", | |
"!agalma expression --id SRX036876 SRX288430" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366/css'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX033366/img'\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 24\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 26\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 28\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 30\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 32\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 34\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 36\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 38\n", | |
"agalma.agalma_report.report_runs: 40 has pipelines: qc\n", | |
"agalma.agalma_report.report_runs: added qc report for 40\n", | |
"agalma.agalma_report.report_runs: 42 has pipelines: expression\n", | |
"agalma.agalma_report.report_runs: added expression report for 42\n", | |
"biolite.config.parse_env_resources: threads=6\n", | |
"biolite.config.parse_env_resources: memory=14441M\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876/css'\n", | |
"biolite.utils.safe_mkdir: creating directory '/home/ljcohen/reports/SRX036876/img'\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 25\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 27\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 29\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 31\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 33\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 35\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 37\n", | |
"agalma.agalma_report.report_runs: skipping unfinished run 39\n", | |
"agalma.agalma_report.report_runs: 41 has pipelines: qc\n", | |
"agalma.agalma_report.report_runs: added qc report for 41\n", | |
"agalma.agalma_report.report_runs: 43 has pipelines: expression\n", | |
"agalma.agalma_report.report_runs: added expression report for 43\n" | |
] | |
} | |
], | |
"source": [ | |
"!agalma report --id SRX033366 --outdir reports/SRX033366\n", | |
"!agalma report --id SRX036876 --outdir reports/SRX036876" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# To download results to local computer:\n", | |
"\n", | |
"```\n", | |
"scp -r ljcohen@149.165.169.78:/home/ljcohen/reports/ ~/Documents/agalma_tutorial/\n", | |
"```" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment