Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Created April 5, 2016 15:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save gregcaporaso/d4b0dd71d72e9c864989a1ff4545991d to your computer and use it in GitHub Desktop.
Save gregcaporaso/d4b0dd71d72e9c864989a1ff4545991d to your computer and use it in GitHub Desktop.
Final assignment for BIO/CS 499/599 (Spring 2016)
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this assignment, you'll be using QIIME to analyze two data sets. First, you'll work through a tiny test data set, which contains about 100 sequences from a few different human body sites collected over a few days. This will familiarize you with running QIIME through the IPython Notebook, and working with QIIME output. You'll then work through an analysis of a real-world data set, where human-associated microbial communities were shown to have forensic potential, potentially allowing investigators to determine who touched an object based on the \"microbial fingerprint\" they leave behind. \n",
"\n",
"Before starting this assignment, you'll work through some reading that will provide you with the necessary background to complete this assignment.\n",
"\n",
"You should begin by reading [Fierer et al (2010)](http://www.pnas.org/content/early/2010/03/01/1000162107.full.pdf), where this forensic study was initially published. You should then review the QIIME [Illumina Overview Tutorial](http://nbviewer.ipython.org/github/qiime/qiime/blob/1.8.0/examples/ipynb/illumina_overview_tutorial.ipynb), which will cover some of the commands that you'll be using here. Finally, as you work through the assignment, any time you use a QIIME command you should look up that command in [QIIME script index](http://qiime.org/documentation/script_index.html) to understand what it does and how to use it. \n",
"\n",
"**IMPORTANT**: This assignment includes some steps that will take some time to run (possibly up to 20 minutes). You should avoid starting this assignment close to the deadline as the server will be backed up at that time if a lot of people are running their analyses at the same time. You are responsible for starting this in a timely manner! If your assignment is late because it did not complete in time, that is your fault, not ours!\n",
"\n",
"**IMPORTANT**: If you create a file that later needs to be removed (e.g., because it was created incorrectly, and you get an error when trying to re-run the command because the file already exists), you can remove it using ``!rm`` or ``!rmdir`` (where the ``!`` tells IPython that you're issuing a shell/bash command, rather than a python command). For example, if I needed to remove a directory call ``misc_junk``, I could run:\n",
"\n",
"```\n",
"!rmdir misc_junk\n",
"```\n",
"\n",
"If I needed to remove a file called ``junk.txt``, I could run:\n",
"\n",
"```\n",
"!rm junk.txt\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started: notebook configuration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**IMPORTANT**: If you start your assignment, and then leave the notebook server, you will need to re-run the cells in this section to configure your IPython Notebook session."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from os import chdir, mkdir, rename\n",
"from os.path import exists, join, expandvars\n",
"from qiime.test import write_test_data\n",
"from IPython.display import FileLinks, FileLink\n",
"\n",
"raw_data_base = \"/home/jgc53/public_html/qiime_assignment_materials\"\n",
"reference_seqs = join(raw_data_base, \"88_otus.fasta\")\n",
"reference_tree = join(raw_data_base, \"88_otus.tree\")\n",
"reference_tax = join(raw_data_base, \"88_otu_taxonomy.txt\")\n",
"\n",
"forensic_seqs = join(raw_data_base, \"forensic-seqs.fna\")\n",
"forensic_map = join(raw_data_base, \"forensic-map.txt\")\n",
"\n",
"personal_dir = './'\n",
"\n",
"tiny_test_dir = \"tiny-test\"\n",
"tiny_test_path = join(personal_dir, tiny_test_dir)\n",
"forensic_dir = \"forensic\"\n",
"forensic_path = join(personal_dir, forensic_dir)\n",
"\n",
"if not exists(tiny_test_path):\n",
" mkdir(tiny_test_path)\n",
" \n",
"if not exists(forensic_path):\n",
" mkdir(forensic_path)\n",
" \n",
"write_test_data(tiny_test_path)\n",
"rename(join(tiny_test_path, \"map\"), join(tiny_test_path, \"map.txt\"))\n",
"rename(join(tiny_test_path, \"seqs\"), join(tiny_test_path, \"seqs.fna\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Running example commands with the QIIME \"Tiny Test\" data set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can always see what files are present, and view or download them, using ``FileLinks`` (or ``FileLink``, if providing a single file path). Execute this cell and open the `map.txt` file to view the sample metadata associated with this study."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"FileLinks(tiny_test_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step in a QIIME analysis is to prepare the mapping file and validate it. In this case, the mapping file has been prepared for you, so you just need to validate it. We do that with [`validate_mapping_file.py`](http://qiime.org/scripts/validate_mapping_file.html). You can run that as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"input_map_fp = join(tiny_test_dir, 'map.txt')\n",
"output_dir = join(tiny_test_dir, 'vmf_out')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!validate_mapping_file.py -m $input_map_fp -o $output_dir"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that there were errors and/or warnings generated during mapping file validation. Open and view the `map.html` file in the vmf output directory. We see that the only issues were with barcodes, and we are not using the barcodes in this example, so we can continue on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"FileLinks(output_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that there were errors and/or warnings generated during mapping file validation. Open and view the `map.html` file in the vmf output directory. We see that the only issues were with barcodes, and we are not using the barcodes in this example, so we can continue on."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we're going to run open reference OTU picking at 88% identity using [pick_open_reference_otus.py](http://qiime.org/scripts/pick_open_reference_otus.html). You should read [*OTU picking strategies in QIIME*](http://qiime.org/tutorials/otu_picking.html) to understand what this process does, and why we prefer this to other strategies. This will take a few minutes to run."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"input_seqs_fp = join(tiny_test_dir, 'seqs.fna')\n",
"output_dir = join(tiny_test_dir, 'or_otus')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!pick_open_reference_otus.py -i $input_seqs_fp -o $output_dir -r $reference_seqs -s 0.88"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now review the output generated by this command. Much of these are intermediary data files, so not useful to view directly, though do open the log file (listed at the top as ``log...txt``) to get an idea of what information is stored in the log file (e.g., how can you compute the run time from the information in the log file?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"or_index = join(output_dir, \"index.html\")\n",
"FileLink(or_index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create variables pointing to several of the files that we're interested in, and generate a summary of the output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"otu_table_fp = join(tiny_test_dir, 'or_otus', 'otu_table_mc2_w_tax_no_pynast_failures.biom')\n",
"output_fp = join(tiny_test_dir, 'or_otus', 'otu_table_mc2_w_tax_no_pynast_failures_summ.txt')\n",
"tree_fp = join(tiny_test_dir, 'or_otus', 'rep_set.tre')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!biom summarize-table -i $otu_table_fp -o $output_fp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can look at that summary (via the link below) to see how many sequences were obtained per sample. One thing we need to do here is choose an even sampling depth for our diversity analyses. Any samples with fewer than that number of sequences will be discarded, and any samples with more than that number of sequences will be randomly subsampled (without replacement, i.e., *rarefied*) to contain that number of sequences. In this case, I'll choose 17, but that number is very study specific. **You do not want to use a sampling depth of 17 for the forensic study.** "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"FileLink(output_fp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll next run [`core_diversity_analyses.py`](http://qiime.org/scripts/core_diversity_analyses.html), which runs several different diversity analysis commands. The primary parameters that will vary for this command are the even sampling depth (mentioned above) and the categories (or mapping file headers). Both of these are study specific. **Avoid passing categories that have more than about six different values in the mapping file, as these will take a very, very long time to run.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"even_sampling_depth = 17\n",
"output_dir = join(tiny_test_dir, \"cd_even%d\" % even_sampling_depth)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!core_diversity_analyses.py -i $otu_table_fp -o $output_dir -t $tree_fp -e $even_sampling_depth -c \"SampleType,days_since_epoch\" -m $input_map_fp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now view the output of our diversity analyses. The file of interest here is ``index.html``, so we'll link directly to that. Review the taxonomic summaries, the alpha diversity results, and the beta diversity results. For the forensic data set, you'll turn in the *Rarified BIOM table*, which is linked from the ``index.html``."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cd_index = join(output_dir, \"index.html\")\n",
"FileLink(cd_index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Microbiome-based forensics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, you'll adapt the commands above to perform a more interesting analysis and answer some questions. The commands will be largely the same as above, but instead you'll use the following variables:\n",
"\n",
"```\n",
"forensic_dir\n",
"forensic_seqs\n",
"forensic_map\n",
"```\n",
"\n",
"For example, here is the ``validate_mapping_file.py`` command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"input_map_fp = forensic_map\n",
"output_dir = join(forensic_dir, 'vmf_out')\n",
"!validate_mapping_file.py -m $input_map_fp -o $output_dir"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"FileLinks(output_dir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, run ``pick_open_reference_otus.py`` (hint: the reference sequences are the same as for the previous data set)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, run ``biom summarize-table``, and choose an even sampling depth (hint: it should probably be over 100)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, run ``core_diversity_analyses.py``, and be sure to set the ``even_sampling_depth`` variable before you execute the command. This may take about 20 minutes. **DO NOT pass mapping file headers that contain more than about six values in the column, or this will take much, much longer.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"even_sampling_depth = None\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Questions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Answer all of the questions below in the notebook, and turn in a copy of your notebook with your rarefied BIOM table (linked from the ``core_diversity_analyses.py`` ``index.html`` file that you created (right click the link, and *Save as...*). All of these questions refer to the forensic data set, not the tiny test data set!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question 1: What was the minimum number of sequences per sample? What was the maximum number of sequences per sample? What even sampling depth did you choose, and why?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Answer Question 1 in this cell.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question 2: How long did ``pick_open_reference_otus.py`` take to run (to the second). Review the log file, and compute the run time from information in that file. How long did ``core_diversity_analyses.py`` take to run (again, to the second)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Answer Question 2 in this cell.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Question 3: The focus of the *Fierer 2010* paper was to show that it is possible to match an individual to the objects they touch based on the microbial communities that the individual leaves behind. The Unweighted UniFrac emperor plots (linked from the ``core_diversity_analyses.py`` ``index.html`` file) will allow you to figure out which subject (`M2`, `M3`, or `M9`) touched which keyboard (`K1`, `K2`, or `K3`). Match the individuals to the keyboard they touched, and explain how you came to this answer. There is one right answer to this question."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Answer Question 3 in this cell.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To turn in:\n",
"\n",
"* Gzipped *Rarified BIOM table*, available in the ``core_diversity_analyses.py`` output.\n",
"* The IPython Notebook containing containing the full list of commands that you ran to generate the above data, including answers to the questions in Part 2, noting any problems that you ran into along the way"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3: Interpreting and reporting microbiome results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This section will focus on working with the results generated by the ``core_diversity_analyses.py`` command that you ran above.\n",
"\n",
"For this section, you will write a 2.5 to 3 page paper describing your analysis. **Your paper should not be any shorter than 2.5 pages nor any longer than 3 pages. It must have 1.5 line spacing, 1.25\" margins, and be written in 12 point Times New Roman font. Figures and tables are included in the page count, though the total space taken by these should be a maximum of one page.**\n",
"\n",
"Write this as if you’re submitting to a journal, so your paper should contain:\n",
"\n",
"* A title and list of authors (just you, in this case). This should be in no more than 14 point font, and follow the margin/spacing requirements listed above.\n",
"* An *Introduction* section describing the goals of the study (you can refer to [Fierer et al (2010)](http://www.pnas.org/content/early/2010/03/01/1000162107.full.pdf));\n",
"* A *Methods* section containing a brief description of your bioinformatics methods (e.g., what version of QIIME, what type of OTU picking was used) and how the data was generated (e.g., sequencing platform);\n",
"* And a *Results and Conclusions* section describing the results of your analysis. \n",
"\n",
"You should address several specific questions in your *Results and Conclusions*:\n",
"\n",
"* Which keyboard belonged to each of the human subjects? Back this up with an Unweighted UniFrac PCoA figure that supports your answer, as well as the beta diversity boxplots that support your answer. Describe what each of these plots tells you that supports your answer. See *Beta diversity results* in the ``core_diversity_analyses.py`` output.\n",
"* Which bacterial phyla appear to be most useful for matching each individual to their keyboard? See *Taxonomic summary results* in the ``core_diversity_analyses.py`` output.\n",
"* Which ten OTUs were most significantly different across the subjects (see *Group significance results* in the ``core_diversity_analyses.py`` output). Present a table containing these ten OTUs and their taxonomy. \n",
"* Which subjects had the most OTUs, on average? Was the number of OTUs across subjects significantly different? See *Alpha diversity results* in the ``core_diversity_analyses.py`` output. Include a plot that illustrates the number of OTUs observed for each subject."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment