Skip to content

Instantly share code, notes, and snippets.

@jennomics
Last active August 29, 2015 14:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jennomics/c6fe5e113525c6aa8add to your computer and use it in GitHub Desktop.
Save jennomics/c6fe5e113525c6aa8add to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"This notebook is for the analysis of 15 16S PCR libraries that were produced from DNA extracted from meals, collected by blending a plate of food in the blender. The meals were prepared to be typical of three diet types (Average American, USDA-recommended, and Vegan)</p></p>\n",
"\n",
"Before launching this ipython notebook, I typed the macqiime command to configure the shell. I'm using macqiime 1.8.0\n",
"http://www.wernerlab.org/software/macqiime/macqiime-installation \n",
"\n"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"I copied the stuff below from a QIIME/iPython notebook tutorial: http://nbviewer.ipython.org/github/qiime/qiime/blob/1.8.0/examples/ipynb/illumina_overview_tutorial.ipynb</p></p>I'm not sure what all of it does...\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from os import chdir, mkdir\n",
"from os.path import join\n",
"#the following are only available in the current development branch of IPython\n",
"from IPython.display import FileLinks, FileLink"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Substitute the file paths below, and then there is no need to change any of the code in the rest of the notebook.</p></p>\n",
"The version of macqiime that I'm using does not install the greengenes 99% cutoff OTUs and taxonomy, so did that manually as per the instructions on the MacQIIME Installation site. I just substituted the gg_13_8_otus folder that has all of the otu cutoffs for the one included in macqiime/greengenes/ \n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"project_name = \"MicrobesWeEat\"\n",
"sequence_file = \"./MicrobesWeEat.fasta\"\n",
"non_chimeric_sequence_file = \"./non_chimeric_sequences.fasta\"\n",
"mapping_file = \"./MicrobesWeEat.txt\"\n",
"otu_base = \"/macqiime/greengenes/gg_13_8_otus/\"\n",
"reference_seqs_99 = join(otu_base,\"/macqiime/greengenes/gg_13_8_otus/rep_set/99_otus.fasta\")\n",
"reference_tree_99 = join(otu_base,\"/macqiime/greengenes/gg_13_8_otus/trees/99_otus.tree\")\n",
"reference_tax_99 = join(otu_base,\"/macqiime/greengenes/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt\")\n",
"reference_seqs_97 = join(otu_base,\"/macqiime/greengenes/gg_13_8_otus/rep_set/97_otus.fasta\")\n",
"reference_tree_97 = join(otu_base,\"/macqiime/greengenes/gg_13_8_otus/trees/97_otus.tree\")\n",
"reference_tax_97 = join(otu_base,\"/macqiime/greengenes/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt\")\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Make sure mapping file is good to go. This also provides a quick check for macqiime."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!validate_mapping_file.py -m $mapping_file"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No errors or warnings were found in mapping file.\r\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!ls RawSequenceData/\n",
"#Listed are the raw sequence files."
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 43
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Demultiplex and join."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Gotta unzip them to work with them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!gunzip RawSequenceData/*.gz\n",
"#!ls RawSequenceData/"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 44
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"After consulting with a QIIME developer via the forum, I know that there is an issue with the trailing *:N:0 in all of the fastq files.\n",
"\n",
"A fix before proceeding is below, but this should not be necessary in future releases."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_001.fastq > HMSB_AZ_35_NoIndex_L001_R1_001_fixed.fastq\n",
"!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_001.fastq > HMSB_AZ_35_NoIndex_L001_R4_001_fixed.fastq\n",
"!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_002.fastq > HMSB_AZ_35_NoIndex_L001_R1_002_fixed.fastq\n",
"!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_002.fastq > HMSB_AZ_35_NoIndex_L001_R4_002_fixed.fastq\n",
"!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_003.fastq > HMSB_AZ_35_NoIndex_L001_R1_003_fixed.fastq\n",
"!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_003.fastq > HMSB_AZ_35_NoIndex_L001_R4_003_fixed.fastq\n",
"!sed 's/ 1:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R1_004.fastq > HMSB_AZ_35_NoIndex_L001_R1_004_fixed.fastq\n",
"!sed 's/ 4:N:0://g' RawSequenceData/HMSB_AZ_35_NoIndex_L001_R4_004.fastq > HMSB_AZ_35_NoIndex_L001_R4_004_fixed.fastq"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Create a new barcode file in which the forward and reverse barcodes are concatenated into a single barcode."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_001.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_001.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane1.barcodes\n",
"!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_002.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_002.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane2.barcodes\n",
"!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_003.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_003.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane3.barcodes\n",
"!extract_barcodes.py -r RawSequenceData/HMSB_AZ_35_NoIndex_L001_R2_004.fastq -f RawSequenceData/HMSB_AZ_35_NoIndex_L001_R3_004.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o Lane4.barcodes"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Again remove the trailing *:N:0 in all of the barcode fastq files."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!sed 's/ 3:N:0://g' Lane1.barcodes/barcodes.fastq > Lane1.barcodes.fastq\n",
"!sed 's/ 3:N:0://g' Lane2.barcodes/barcodes.fastq > Lane2.barcodes.fastq\n",
"!sed 's/ 3:N:0://g' Lane3.barcodes/barcodes.fastq > Lane3.barcodes.fastq\n",
"!sed 's/ 3:N:0://g' Lane4.barcodes/barcodes.fastq > Lane4.barcodes.fastq"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 26
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Now, \"join\" the paired ends. This will create sequences that have the forward barcode at the beginning and the reverse barcode at the end."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_001_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_001_fixed.fastq -b Lane1.barcodes.fastq -o Lane1_joined\n",
"!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_002_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_002_fixed.fastq -b Lane2.barcodes.fastq -o Lane2_joined\n",
"!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_003_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_003_fixed.fastq -b Lane3.barcodes.fastq -o Lane3_joined\n",
"!join_paired_ends.py -r HMSB_AZ_35_NoIndex_L001_R1_004_fixed.fastq -f HMSB_AZ_35_NoIndex_L001_R4_004_fixed.fastq -b Lane4.barcodes.fastq -o Lane4_joined"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 33
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Now, time to demultiplex!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane1_joined/fastqjoin.join.fastq -o Demultiplexed_Lane1 -m MicrobesWeEat.txt -b Lane1_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2\n",
"!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane2_joined/fastqjoin.join.fastq -o Demultiplexed_Lane2 -m MicrobesWeEat.txt -b Lane2_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2\n",
"!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane3_joined/fastqjoin.join.fastq -o Demultiplexed_Lane3 -m MicrobesWeEat.txt -b Lane3_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2\n",
"!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i Lane4_joined/fastqjoin.join.fastq -o Demultiplexed_Lane4 -m MicrobesWeEat.txt -b Lane4_joined/fastqjoin.join_barcodes.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 111
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"I want to know how many reads per library I have."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!cat Demultiplexed_Lane1/split_library_log.txt\n",
"#!cat Demultiplexed_Lane2/split_library_log.txt\n",
"#!cat Demultiplexed_Lane3/split_library_log.txt\n",
"#!cat Demultiplexed_Lane4/split_library_log.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 42
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"At this point, I want to concatenate all of the lanes, but there will be multiple sequences with the same ID (e.g., Unassigned_0) and I'm assuming that will cause problems. So, I have 2 choices: concatenate before running split_libraries_fastq.py OR figure out if QIIME will take more than one input fasta file. I searched for a bit to find out if the second option will work, but I ran out of patience with the search before I found an answer. So, I'm going concatenate the output from join_paired_ends.py and run split_libraries_fastq.py again."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat Lane*_joined/fastqjoin.join.fastq > All_Lanes_joined.fastq\n",
"!cat Lane*_joined/fastqjoin.join_barcodes.fastq > All_barcodes_joined.fastq"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 118
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The parameters that I'm using for split_libraries.py were optimized through trial and error. Either pretty much all sequences passed the quality filter or none of them did, so tuning the parameters was pretty straightforward. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!split_libraries_fastq.py -q 5 -r 9 -p 0.5 -i All_Lanes_joined.fastq -o Demultiplexed_All -m MicrobesWeEat.txt -b All_barcodes_joined.fastq --retain_unassigned_reads --barcode_type 16 --max_barcode_errors 2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 119
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat Demultiplexed_All/split_library_log.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"cat: Demultiplexed_All/split_library_log.txt: No such file or directory\r\n"
]
}
],
"prompt_number": 79
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!ls Demultiplexed_All/"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 45
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Pick closed reference OTUs for PICRUSt"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"While there are good reasons NOT to use closed-reference otu-picking in general, because one of the things I'd like to do is to use PICRUSt to predict a metagenome, I will start by using the 99% cutoff greengenes reference files for closed-reference picking"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!pick_closed_reference_otus.py -p 99parameters.txt -o greengenes_99_otus -i Demultiplexed_All/seqs.fna -r $reference_seqs_99 -t $reference_tax_99 -a -O 2 -f"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!pick_closed_reference_otus.py -p 97parameters.txt -o greengenes_97_otus -i Demultiplexed_All/seqs.fna -r $reference_seqs_97 -t $reference_tax_97 -a -O 2 -f"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!cat greengenes_99_otus/uclust_ref_picked_otus/seqs_otus.log"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!biom summarize-table -i greengenes_99_otus/otu_table.biom -o 99_otu_table_summary.txt\n",
"#!cat 99_otu_table_summary.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 64
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!biom summarize-table -i greengenes_97_otus/otu_table.biom -o 97_otu_table_summary.txt\n",
"#!cat 97_otu_table_summary.txt"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"cat: 97_otu_table_summary.txt: No such file or directory\r\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!cat greengenes_97_otus/uclust_ref_picked_otus/seqs_otus.log"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Now for chimera-checking (I suppose this could be done earlier, actually.) Next time, I would do it just fter split_libraries_fastq.py). The input must be unaligned fasta sequences in the same orientation as the reference database. That's why I'm using the output from split_libraries_fastq.py, as recommended."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!identify_chimeric_seqs.py -i Demultiplexed_All/seqs.fna -m usearch61 -o 97_usearch61_chimera_detection/ -r $reference_seqs_97\n",
"#!identify_chimeric_seqs.py -i Demultiplexed_All/seqs.fna -m usearch61 -o 99_usearch61_chimera_detection/ -r $reference_seqs_99"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Remove the chloroplast and mitochondrial sequences. I'm also removing the \"Unassigned\" sequences. There are 795 of them in 39 OTUs. I manually blasted the representative sequences from each OTU, and they were all chloroplasts. I am also removing all of the remaining singletons."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_taxa_from_otu_table.py -i greengenes_97_otus/otu_table.biom -o closed_ref_97_otu_table_no_euks.biom -n c__Chloroplast,f__mitochondria\n",
"!filter_taxa_from_otu_table.py -i closed_ref_97_otu_table_no_euks.biom -o closed_ref_97_otu_table_no_euks_no_unassigned.biom -n Unassigned\n",
"!filter_otus_from_otu_table.py -i closed_ref_97_otu_table_no_euks_no_unassigned.biom -o closed_ref_97_otu_table_filtered.biom -n 2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_taxa_from_otu_table.py -i greengenes_99_otus/otu_table.biom -o closed_ref_99_otu_table_no_euks.biom -n c__Chloroplast,f__mitochondria\n",
"!filter_taxa_from_otu_table.py -i closed_ref_99_otu_table_no_euks.biom -o closed_ref_99_otu_table_no_euks_no_unassigned.biom -n Unassigned\n",
"!filter_otus_from_otu_table.py -i closed_ref_99_otu_table_no_euks_no_unassigned.biom -o closed_ref_99_otu_table_filtered.biom -n 2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 37
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!biom summarize-table -i closed_ref_99_otu_table_filtered.biom -o closed_ref_99_otu_table_filtered.summary\n",
"#!cat closed_ref_99_otu_table_filtered.summary"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"This is a reminder to look at the biom summary before proceeding. In my case, I still have all of the PhiX and other unassigned reads in my data. So, I created a file called samples_to_keep.txt with a list of my actual sample IDs in it, and I will use this list to remove the Unassigned reads."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_samples_from_otu_table.py -i closed_ref_97_otu_table_filtered.biom -o closed_ref_97_otu_table_final.biom --sample_id_fp samples_to_keep.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_samples_from_otu_table.py -i closed_ref_99_otu_table_filtered.biom -o closed_ref_99_otu_table_final.biom --sample_id_fp samples_to_keep.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 68
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Run PICRUSt"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Normalize otu table by 16S copy number"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/normalize_by_copy_number.py -i closed_ref_97_otu_table_final.biom -o normalized_closed_ref_97_otu_table.biom"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/normalize_by_copy_number.py -i closed_ref_99_otu_table_final.biom -o normalized_closed_ref_99_otu_table.biom"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 69
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Predict the metagenome, note that I am using the -a flag to calculate NSTI."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/predict_metagenomes.py -a normalized_closed_ref_97_NSTI.tab -i normalized_closed_ref_97_otu_table.biom -o metagenome_prediction_from_normalized_closed_ref_97_otu_table.biom"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/predict_metagenomes.py -a normalized_closed_ref_99_NSTI.tab -i normalized_closed_ref_99_otu_table.biom -o metagenome_prediction_from_normalized_closed_ref_99_otu_table.biom"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 70
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!more normalized_closed_ref_97_NSTI.tab"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!more normalized_closed_ref_99_NSTI.tab"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 72
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Collapse the thousands of predicted functions into higher categories (e.g KOs into KEGG Pathways) and generate STAMP profiles"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_97_otu_table.biom -c KEGG_Pathways -l 3 -o 97_MWE_predicted_metagenomes.L3.txt\n",
"!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_97_otu_table.biom -c KEGG_Pathways -l 2 -o 97_MWE_predicted_metagenomes.L2.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_99_otu_table.biom -c KEGG_Pathways -l 3 -o 99_MWE_predicted_metagenomes.L3.txt\n",
"!/Applications/picrust-1.0.0/scripts/categorize_by_function.py -f -i metagenome_prediction_from_normalized_closed_ref_99_otu_table.biom -c KEGG_Pathways -l 2 -o 99_MWE_predicted_metagenomes.L2.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 73
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!sed '1d' 97_MWE_predicted_metagenomes.L3.txt | rev | cut -f 2- | rev > 97_MWE_predicted_metagenome.L3.spf\n",
"!sed '1d' 97_MWE_predicted_metagenomes.L2.txt | rev | cut -f 2- | rev > 97_MWE_predicted_metagenome.L2.spf"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!sed '1d' 99_MWE_predicted_metagenomes.L3.txt | rev | cut -f 2- | rev > 99_MWE_predicted_metagenome.L3.spf\n",
"!sed '1d' 99_MWE_predicted_metagenomes.L2.txt | rev | cut -f 2- | rev > 99_MWE_predicted_metagenome.L2.spf"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 74
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The most significant difference between Diet Types is \"Other Glycan Degradation\" (K00511) at L3 (see figure).</p></p>\n",
"There are no significant differences at L2."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Identify which OTUs are contributing to K00511"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/metagenome_contributions.py -l K00511 -i normalized_closed_ref_97_otu_table.biom -o 97_metagenome_contribution_K00511 -g 13_5"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Filtering the genome table to include only user-specified functions: ['K00511']\r\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!/Applications/picrust-1.0.0/scripts/metagenome_contributions.py -l K00511 -i normalized_closed_ref_99_otu_table.biom -o 99_metagenome_contribution_K00511 -g 13_5"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Filtering the genome table to include only user-specified functions: ['K00511']\r\n"
]
}
],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat 97_metagenome_contribution_K00511"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Gene\tSample\tOTU\tGeneCountPerGenome\tOTUAbundanceInSample\tCountContributedByOTU\tContributionPercentOfSample\tContributionPercentOfAllSamples\r\n",
"K00511\tUSDAdinner\t4366579\t1.0\t1.0\t1.0\t1.0\t1.0"
]
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat 99_metagenome_contribution_K00511"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Gene\tSample\tOTU\tGeneCountPerGenome\tOTUAbundanceInSample\tCountContributedByOTU\tContributionPercentOfSample\tContributionPercentOfAllSamples\r\n",
"K00511\tUSDAdinner\t106978\t1.0\t1.0\t1.0\t1.0\t1.0"
]
}
],
"prompt_number": 16
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"So, OTU 106978, which is Corallococcus exiguus, is abundant in the USDA dinner, and is the contributor to the \"Other Glycan Degradation\" in this metagenome"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Pick OTUs"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The QIIME developers prefer the open reference OTU-picking approach, and I have no reason not to go with the default 97% cutoff, so that's what I'll do here. The pick_open_reference_otus.py script will cluster all of the sequences, assign taxonomy to the OTUs (when possible, a greengenes ID will be assigned,) choose a representative sequence from each OTU (rep_set), and align and build a phylogenetic tree from the representative sequences."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"This time I'll remember to get rid of the Unassigned reads before I pick the OTUs!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_fasta.py -f MicrobesWeEat.fasta -o MicrobesWeEat_NoUnassigned.fasta --sample_id_fp samples_to_keep.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 81
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!pick_open_reference_otus.py -p 97parameters.txt -r $reference_seqs_97 -i MicrobesWeEat_NoUnassigned.fasta -o 97_open_reference_otus -f"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 96
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Sanity check for the otu table."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!biom summarize-table -i 97_open_reference_otus/otu_table_mc2_w_tax_no_pynast_failures.biom -o otu_table_summary_before_cleanup.txt\n",
"#!cat otu_table_summary_before_cleanup.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Filter out singletons, mitochondria, and chloroplasts."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_taxa_from_otu_table.py -i 97_open_reference_otus/otu_table_mc2_w_tax_no_pynast_failures.biom -o open_ref_97_otu_table_no_euks.biom -n c__Chloroplast,f__mitochondria\n",
"!filter_taxa_from_otu_table.py -i open_ref_97_otu_table_no_euks.biom -o open_ref_97_otu_table_no_euks_no_unassigned.biom -n Unassigned\n",
"!filter_otus_from_otu_table.py -i open_ref_97_otu_table_no_euks_no_unassigned.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.biom -n 2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 88
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#sanity check\n",
"#!biom summarize-table -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.summary\n",
"#!cat open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.summary"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 173
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!identify_chimeric_seqs.py -m ChimeraSlayer -i rep_set_aligned.fasta -a core_set_aligned.fasta.imputed -o chimeric_seqs_open_97.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 104
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"I couldn't get ChimeraSlayer to run. QIIME support thinks that the problem might be that I do not have enough memory because he was able to run it on my file. He was nice enough to send me the output, so I can proceed."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Now that I have my list of chimeric OTUs, I need to remove them from the OTU table, remove them from my aligned representative sequences, build a new phylogenetic tree from the new rep set alignment (I think, but all of this may not be necessary.)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#remove chimeras from aligned rep_set \n",
"!filter_fasta.py -f 97_open_reference_otus/pynast_aligned_seqs/rep_set_aligned.fasta -o non_chimeric_rep_set_aligned.fasta -s chimeric_seqs_open_97.txt -n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 170
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!make_phylogeny.py -i non_chimeric_rep_set_aligned.fasta"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 108
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_otus_from_otu_table.py -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.biom -e chimeric_seqs_open_97.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 172
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#sanity check\n",
"#!biom summarize-table -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.biom -o open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.summary\n",
"#!cat open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.summary"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 175
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The following steps won't work without the metadata added to the biom file, so I'll do that first. I had to dig the observation metadata mapping file out of here: uclust_assigned_taxonomy/rep_set_tax_assignments.txt AND I had to add a header row to make it look like the example files in http://biom-format.org/documentation/adding_metadata.html"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!rm 97_open_otu_table_wmd.biom\n",
"!biom add-metadata -i open_ref_97_otu_table_no_euks_no_unassigned_no_singletons_no_chimeras.biom -o 97_open_otu_table_wmd.biom --sample-metadata-fp MicrobesWeEat.txt --observation-metadata-fp rep_set_tax_assignments_header.txt "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 26
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Core Diversity Analyses"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Running core_diversity_analyses.py with libraries subsampled to 893 sequences and categorized by DietType. Phylogeny-based analyses will use the new tree with the chimeric sequences removed."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#sanity check\n",
"#!biom summarize-table -i 97_open_otu_table_wmd.biom -o 97_open_otu_table_wmd.summary\n",
"#!cat 97_open_otu_table_wmd.summary"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#unique counts instead of totals\n",
"#!biom summarize-table -i 97_open_otu_table_wmd.biom --qualitative -o 97_open_otu_table_wmd.qualitative.summary\n",
"#!cat 97_open_otu_table_wmd.qualitative.summary"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 37
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#unique counts instead of totals, but with rarefied OTU table\n",
"!biom summarize-table -i core_diversity_analyses_open_ref_97/table_even771.biom --qualitative -o 97_open_otu_table_771.qualitative.summary\n",
"!cat 97_open_otu_table_771.qualitative.summary"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Num samples: 15\r\n",
"Num observations: 729\r\n",
"Table md5 (unzipped): 0f5f8fd0fc772b57d5af6c39cacea84b\r\n",
"\r\n",
"Observations/sample summary:\r\n",
" Min: 13\r\n",
" Max: 203\r\n",
" Median: 116.000\r\n",
" Mean: 111.067\r\n",
" Std. dev.: 60.793\r\n",
" Sample Metadata Categories: Iron_bin; Carotene_beta; phinchID; Fermented; Sodium; Carotene_beta_bin; Protein_bin; Dairy; Forward; Sodium_bin; Sugars_bin; Fiber; Description; BarcodeSequence; PC1_bin; Fruit; Energy_bin; Vitamin_A_bin; Vitamin_A; Fatty_acids; Carbohydrate_bin; Fiber_bin; Cholesterol_bin; DietType; Reverse; Calcium_bin; Cooked; LinkerPrimerSequence; Iron; Cholesterol; HFCS; Meal; Total_lipid_bin; Total_lipid; Calcium; Vitamin_C; Energy; Sugars; Carbohydrate; PC1; Protein; Fatty_Acids_bin; Vitamin_C_bin\r\n",
" Observation Metadata Categories: taxonomy; confidence\r\n",
"\r\n",
"Observations/sample detail:\r\n",
" USDAsnack1: 13\r\n",
" USDAsnack2: 26\r\n",
" AMERICANsnack: 41\r\n",
" AMERICANlunch: 42\r\n",
" VEGANsnack1: 58\r\n",
" VEGANlunch: 94\r\n",
" VEGANbreakfast: 106\r\n",
" USDAlunch: 116\r\n",
" AMERICANbreakfast: 123\r\n",
" USDAdinner: 154\r\n",
" VEGANdinner: 167\r\n",
" USDAbreakfast: 168\r\n",
" AMERICANdinner: 170\r\n",
" VEGANsnack3: 185\r\n",
" VEGANsnack2: 203\r\n"
]
}
],
"prompt_number": 38
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#!rm -r DietType_core_diversity_analyses_open_ref_97\n",
"#!rm -r core_diversity_analyses_open_ref_97"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 181
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!core_diversity_analyses.py -p 97parameters.txt -i 97_open_otu_table_wmd.biom -o DietType_core_diversity_analyses_open_ref_97 -m MicrobesWeEat.txt -e 771 -c DietType -t non_chimeric_rep_set_aligned.tre"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 18
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!core_diversity_analyses.py -p 97parameters.txt -i 97_open_otu_table_wmd.biom -o core_diversity_analyses_open_ref_97 -m MicrobesWeEat.txt -e 771 -t non_chimeric_rep_set_aligned.tre"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"The first time I ran this, it did not work. The issue is that my taxonomy metadata in the biom file is included as strings rather than as lists. I'm not sure why this is the case, but I tracked down the fix, which is to add this line to my parameters file:</p></p>summarize_taxa:md_as_string True"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!nmds.py -i core_diversity_analyses_open_ref_97/bdiv_even771/weighted_unifrac_dm.txt -o NMDS_output"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 19
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!cat NMDS_output"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"samples\tNMDS1\tNMDS2\tNMDS3\r\n",
"AMERICANsnack\t-1.17027261939\t0.160180891398\t0.117604120355\r\n",
"VEGANsnack1\t0.728025005467\t1.73753280116\t-0.0163669886395\r\n",
"AMERICANbreakfast\t0.933966186973\t-0.193126733697\t-0.971073635433\r\n",
"VEGANsnack2\t0.626862241432\t-0.247210344627\t0.486249440075\r\n",
"AMERICANlunch\t-1.10856720972\t0.055161785597\t-0.261550124104\r\n",
"VEGANsnack3\t0.694201915455\t-0.0584857699766\t0.392490895575\r\n",
"VEGANdinner\t0.791638204242\t-0.318713695544\t0.16089612955\r\n",
"VEGANlunch\t-0.549011849737\t-0.571419325572\t0.499809089166\r\n",
"USDAdinner\t0.692697963831\t-0.347314151775\t-0.222069120523\r\n",
"AMERICANdinner\t0.737339108169\t-0.211959856119\t0.021761404915\r\n",
"USDAlunch\t-0.337601140052\t0.372012620336\t0.557853344909\r\n",
"VEGANbreakfast\t-0.0863078410135\t-0.177500463084\t-0.459990243303\r\n",
"USDAbreakfast\t0.549193082536\t-0.253528600159\t0.133269140252\r\n",
"USDAsnack2\t-1.32177734003\t-0.0617229557549\t-0.222565931975\r\n",
"USDAsnack1\t-1.18038570816\t0.116093797813\t-0.216317520821\r\n",
"\r\n",
"stress\t0.0260933547181\t0\t0\r\n",
"% variation explained\t0\t0\t0\t"
]
}
],
"prompt_number": 20
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Statistical Analyses"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"want to look at colelctor's curves without rarefaction"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!alpha_rarefaction.py -i 97_open_otu_table_wmd.biom -m $mapping_file -o 97_alpha_uneven -t 97_open_reference_otus/rep_set.tre -f"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"test for the significance of dietary pattern on the overall microbial community composition"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!rm -r permanova_DietTypes\n",
"!compare_categories.py --method permanova -m $mapping_file -c DietType -i core_diversity_analyses_open_ref_97/bdiv_even771/weighted_unifrac_dm.txt -o permanova_DietTypes"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"test for significant differences in taxonomic richness across dietary patterns"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!compare_alpha_diversity.py -i core_diversity_analyses_open_ref_97/arare_max771/alpha_div_collated/PD_whole_tree.txt -m $mapping_file -c DietType -o compare_alpha_div_DietType_PD"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"test for the significant variation in frequency of individual OTUs across dietary patterns"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L2_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt \n",
"!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt\n",
"!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L4.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L4_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt \n",
"!biom add-metadata -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L5.biom -o core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L5_wmd.biom --sample-metadata-fp MicrobesWeEat.txt #--observation-metadata-fp rep_set_tax_assignments_header.txt "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L2_wmd.biom -m $mapping_file -c DietType -o group_significance_L2\n",
"!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L3_wmd.biom -m $mapping_file -c DietType -o group_significance_L3\n",
"!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L4_wmd.biom -m $mapping_file -c DietType -o group_significance_L4\n",
"!group_significance.py -i core_diversity_analyses_open_ref_97/taxa_plots/table_mc771_sorted_L5_wmd.biom -m $mapping_file -c DietType -o group_significance_L5"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No metadata in biom table.\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No metadata in biom table.\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No metadata in biom table.\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"No metadata in biom table.\r\n"
]
}
],
"prompt_number": 29
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!make_otu_heatmap.py -i 97_open_otu_table_wmd.biom"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!make_otu_heatmap.py -i core_diversity_analyses_open_ref_97/table_even771.biom"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 39
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!per_library_stats.py -i 97_open_otu_table_wmd.biom-m $mapping_file -o 97_open_otu_table_wmd.perlibstats"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/bin/sh: per_library_stats.py: command not found\r\n"
]
}
],
"prompt_number": 34
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Export for Use with Phinch"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!biom add-metadata -i otu_tables/open_ref_97_otu_table_no_euks_no_singletons.biom -o otu_tables/open_ref_97_otu_table_no_euks_no_singletons_with_metadata.biom -m MicrobesWeEat.txt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment