Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Last active August 29, 2015 14:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gregcaporaso/24defc02cfbc9f9411c5 to your computer and use it in GitHub Desktop.
Save gregcaporaso/24defc02cfbc9f9411c5 to your computer and use it in GitHub Desktop.
Comparison of using GG 13_8 versus Greengenes core set as the default reference for PyNAST
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:c22a2a0abde493467a7776a38e441147887f691d16cc5430d2be4353a1612930"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"from qiime.test import write_test_data\n",
"\n",
"# A random 1% of the representative sequences from a de novo OTU\n",
"# picking run on the Moving Pictures of the Human Microbiome dataset\n",
"# (this is biased toward human microbiome OTUs)\n",
"input_seqs_fp = \"moving-picture-rep-seqs.01.fna\"\n",
"\n",
"# sequences derived from my taxonomic assigner comparison:\n",
"# https://github.com/gregcaporaso/short-read-tax-assignment\n",
"# this is a randomly selected subset of GG 13_8 97% OTUs, sliced to the \n",
"# 515F/806R amplicon region. (this is limited in practical applicability \n",
"# because they are known and trusted reference sequences)\n",
"input_seqs_fp = \"/Users/caporaso/code/short-read-tax-assignment/data/simulated-community/B1/rep_set.fna\"\n",
"!rm -r core-set-aligned gg-aligned\n",
"!cp $input_seqs_fp seqs\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"rm: core-set-aligned: No such file or directory\r\n",
"rm: gg-aligned: No such file or directory\r\n"
]
},
{
"output_type": "stream",
"stream": "stderr",
"text": [
"/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/matplotlib/__init__.py:1155: UserWarning: This call to matplotlib.use() has no effect\n",
"because the backend has already been chosen;\n",
"matplotlib.use() must be called *before* pylab, matplotlib.pyplot,\n",
"or matplotlib.backends is imported for the first time.\n",
"\n",
" warnings.warn(_use_error_msg)\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run PyNAST, first using the core set and then using the Greengenes 13_8 85% reference alignment as the template. I picked 85% as it is most similar in number of sequences to the core set, and it doesn't perform much differently than 82% or 88% (which I tested with this same workflow). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!align_seqs.py -t /Users/caporaso/data/greengenes_core_sets/core_set_aligned.fasta.imputed -m pynast -o core-set-aligned -i seqs"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!align_seqs.py -t /Users/caporaso/data/gg_13_8_otus/rep_set_aligned/85_otus.fasta -o gg-aligned -i seqs"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute the md5sum for each of the resulting alignments. This indicates that there are differences in the resulting alignments."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!md5 core-set-aligned/seqs_aligned.fasta gg-aligned/seqs_aligned.fasta"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"MD5 (core-set-aligned/seqs_aligned.fasta) = a41f90e7faa0805c4402ba25147db587\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"MD5 (gg-aligned/seqs_aligned.fasta) = 2a5baa9e0f17e1e0527ee289e51767cb\r\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from skbio import Alignment\n",
"a1 = Alignment.read('core-set-aligned/seqs_aligned.fasta')\n",
"a2 = Alignment.read('gg-aligned/seqs_aligned.fasta')\n",
"\n",
"a1_ids = set(a1.ids())\n",
"a2_ids = set(a2.ids())\n",
"common_ids = a1_ids & a2_ids\n",
"\n",
"print \"Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:\"\n",
"print len(a1_ids - a2_ids), a1_ids - a2_ids\n",
"print \"Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:\"\n",
"print len(a2_ids - a1_ids), a2_ids - a1_ids\n",
"print \"Number of sequences that are in both alignments:\"\n",
"print len(common_ids)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:\n",
"114 set(['1138415', '620458', '787185', '4422781', '1133875', '1139993', '70116', '1137286', '805909', '580539', '637375', '4334501', '316473', '2376256', '241454', '4466525', '231442', '1123821', '294168', '1931229', '4351546', '112266', '1128568', '1147143', '1123941', '144181', '4395261', '1123374', '4419449', '1141371', '320667', '163153', '542199', '104706', '726766', '1123693', '4390557', '4413535', '575851', '262538', '4352288', '4434861', '185692', '146335', '3190394', '165845', '250116', '32422', '1138080', '252674', '181083', '168712', '834883', '104391', '328080', '140938', '161026', '4433573', '231374', '4370558', '1132594', '4341657', '805753', '1136495', '224123', '1139850', '3732401', '535932', '1148214', '612302', '4420556', '4431104', '4461578', '1140108', '526468', '4442702', '4409907', '835896', '170412', '3136117', '330263', '4299531', '4311115', '532740', '557526', '1137157', '1121115', '4423597', '4354640', '4385539', '812471', '3269889', '524865', '1134236', '322278', '4314389', '2772147', '93366', '4314942', '4459226', '4471196', '168172', '70586', '587064', '2209962', '560746', '563367', '571277', '4434859', '533634', '1141904', '620479', '1147974', '1113318'])\n",
"Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:\n",
"8 set(['4478591', '3836963', '951205', '669760', '3770699', '1638796', '1742076', '3383796'])\n",
"Number of sequences that are in both alignments:\n",
"9780\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a1 = a1.subalignment(seqs_to_keep=common_ids)\n",
"a2 = a2.subalignment(seqs_to_keep=common_ids)\n",
"a1.write('core-set-aligned/seqs_aligned.fasta')\n",
"a2.write('gg-aligned/seqs_aligned.fasta')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply the Lane mask (i.e., static entropy) filtering and gap filtering to the alignment."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_alignment.py -i core-set-aligned/seqs_aligned.fasta -o core-set-aligned/"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!filter_alignment.py -i gg-aligned/seqs_aligned.fasta -o gg-aligned/"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute the md5sum for each of the resulting filtered alignments. This indicates that there are still differences in the resulting alignments."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!md5 core-set-aligned/seqs_aligned_pfiltered.fasta gg-aligned/seqs_aligned_pfiltered.fasta "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"MD5 (core-set-aligned/seqs_aligned_pfiltered.fasta) = 1c261ab57bdfefaa1a99fce00a365972\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"MD5 (gg-aligned/seqs_aligned_pfiltered.fasta) = bda31860b7b4f391b302ea06e871152d\r\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build phylogenetic trees with FastTree from the filtered alignments."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!make_phylogeny.py -i core-set-aligned/seqs_aligned_pfiltered.fasta -o core-set-aligned/seqs_aligned_pfiltered.tre"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!make_phylogeny.py -i gg-aligned/seqs_aligned_pfiltered.fasta -o gg-aligned/seqs_aligned_pfiltered.tre"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the trees into skbio.TreeNode objects, compute DistanceMatrices of tip-to-tip distances for each, and test whether those are correlated with a Mantel test. The fairly high Mantel test statistic, and significant p-value indicate that while the alignments may be different, they result in the similar phylogenetic trees. (Note that we don't know which of these trees is better based on this test). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from skbio import TreeNode\n",
"from skbio.stats.distance import mantel\n",
"\n",
"t1 = TreeNode.read('core-set-aligned/seqs_aligned_pfiltered.tre')\n",
"t2 = TreeNode.read('gg-aligned/seqs_aligned_pfiltered.tre')\n",
"\n",
"mantel(t1.tip_tip_distances(), t2.tip_tip_distances(), strict=False)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"(0.9121502058163361, 0.001, 9780)"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment