gregcaporaso/pynast-reference-comparison.ipynb

## pynast-reference-comparison.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:c22a2a0abde493467a7776a38e441147887f691d16cc5430d2be4353a1612930"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from qiime.test import write_test_data\n",
      "\n",
      "# A random 1% of the representative sequences from a de novo OTU\n",
      "# picking run on the Moving Pictures of the Human Microbiome dataset\n",
      "# (this is biased toward human microbiome OTUs)\n",
      "input_seqs_fp = \"moving-picture-rep-seqs.01.fna\"\n",
      "\n",
      "# sequences derived from my taxonomic assigner comparison:\n",
      "# https://github.com/gregcaporaso/short-read-tax-assignment\n",
      "# this is a randomly selected subset of GG 13_8 97% OTUs, sliced to the \n",
      "# 515F/806R amplicon region. (this is limited in practical applicability \n",
      "# because they are known and trusted reference sequences)\n",
      "input_seqs_fp = \"/Users/caporaso/code/short-read-tax-assignment/data/simulated-community/B1/rep_set.fna\"\n",
      "!rm -r core-set-aligned gg-aligned\n",
      "!cp $input_seqs_fp seqs\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "rm: core-set-aligned: No such file or directory\r\n",
        "rm: gg-aligned: No such file or directory\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/matplotlib/__init__.py:1155: UserWarning:  This call to matplotlib.use() has no effect\n",
        "because the backend has already been chosen;\n",
        "matplotlib.use() must be called *before* pylab, matplotlib.pyplot,\n",
        "or matplotlib.backends is imported for the first time.\n",
        "\n",
        "  warnings.warn(_use_error_msg)\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Run PyNAST, first using the core set and then using the Greengenes 13_8 85% reference alignment as the template. I picked 85% as it is most similar in number of sequences to the core set, and it doesn't perform much differently than 82% or 88% (which I tested with this same workflow). "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!align_seqs.py -t /Users/caporaso/data/greengenes_core_sets/core_set_aligned.fasta.imputed -m pynast -o core-set-aligned -i seqs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!align_seqs.py -t /Users/caporaso/data/gg_13_8_otus/rep_set_aligned/85_otus.fasta -o gg-aligned -i seqs"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compute the md5sum for each of the resulting alignments. This indicates that there are differences in the resulting alignments."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!md5 core-set-aligned/seqs_aligned.fasta gg-aligned/seqs_aligned.fasta"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MD5 (core-set-aligned/seqs_aligned.fasta) = a41f90e7faa0805c4402ba25147db587\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MD5 (gg-aligned/seqs_aligned.fasta) = 2a5baa9e0f17e1e0527ee289e51767cb\r\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from skbio import Alignment\n",
      "a1 = Alignment.read('core-set-aligned/seqs_aligned.fasta')\n",
      "a2 = Alignment.read('gg-aligned/seqs_aligned.fasta')\n",
      "\n",
      "a1_ids = set(a1.ids())\n",
      "a2_ids = set(a2.ids())\n",
      "common_ids = a1_ids & a2_ids\n",
      "\n",
      "print \"Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:\"\n",
      "print len(a1_ids - a2_ids), a1_ids - a2_ids\n",
      "print \"Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:\"\n",
      "print len(a2_ids - a1_ids), a2_ids - a1_ids\n",
      "print \"Number of sequences that are in both alignments:\"\n",
      "print len(common_ids)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:\n",
        "114 set(['1138415', '620458', '787185', '4422781', '1133875', '1139993', '70116', '1137286', '805909', '580539', '637375', '4334501', '316473', '2376256', '241454', '4466525', '231442', '1123821', '294168', '1931229', '4351546', '112266', '1128568', '1147143', '1123941', '144181', '4395261', '1123374', '4419449', '1141371', '320667', '163153', '542199', '104706', '726766', '1123693', '4390557', '4413535', '575851', '262538', '4352288', '4434861', '185692', '146335', '3190394', '165845', '250116', '32422', '1138080', '252674', '181083', '168712', '834883', '104391', '328080', '140938', '161026', '4433573', '231374', '4370558', '1132594', '4341657', '805753', '1136495', '224123', '1139850', '3732401', '535932', '1148214', '612302', '4420556', '4431104', '4461578', '1140108', '526468', '4442702', '4409907', '835896', '170412', '3136117', '330263', '4299531', '4311115', '532740', '557526', '1137157', '1121115', '4423597', '4354640', '4385539', '812471', '3269889', '524865', '1134236', '322278', '4314389', '2772147', '93366', '4314942', '4459226', '4471196', '168172', '70586', '587064', '2209962', '560746', '563367', '571277', '4434859', '533634', '1141904', '620479', '1147974', '1113318'])\n",
        "Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:\n",
        "8 set(['4478591', '3836963', '951205', '669760', '3770699', '1638796', '1742076', '3383796'])\n",
        "Number of sequences that are in both alignments:\n",
        "9780\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a1 = a1.subalignment(seqs_to_keep=common_ids)\n",
      "a2 = a2.subalignment(seqs_to_keep=common_ids)\n",
      "a1.write('core-set-aligned/seqs_aligned.fasta')\n",
      "a2.write('gg-aligned/seqs_aligned.fasta')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Apply the Lane mask (i.e., static entropy) filtering and gap filtering to the alignment."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!filter_alignment.py -i core-set-aligned/seqs_aligned.fasta -o core-set-aligned/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!filter_alignment.py -i gg-aligned/seqs_aligned.fasta -o gg-aligned/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compute the md5sum for each of the resulting filtered alignments. This indicates that there are still differences in the resulting alignments."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!md5 core-set-aligned/seqs_aligned_pfiltered.fasta  gg-aligned/seqs_aligned_pfiltered.fasta "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MD5 (core-set-aligned/seqs_aligned_pfiltered.fasta) = 1c261ab57bdfefaa1a99fce00a365972\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MD5 (gg-aligned/seqs_aligned_pfiltered.fasta) = bda31860b7b4f391b302ea06e871152d\r\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Build phylogenetic trees with FastTree from the filtered alignments."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!make_phylogeny.py -i core-set-aligned/seqs_aligned_pfiltered.fasta -o core-set-aligned/seqs_aligned_pfiltered.tre"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!make_phylogeny.py -i gg-aligned/seqs_aligned_pfiltered.fasta -o gg-aligned/seqs_aligned_pfiltered.tre"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Load the trees into skbio.TreeNode objects, compute DistanceMatrices of tip-to-tip distances for each, and test whether those are correlated with a Mantel test. The fairly high Mantel test statistic, and significant p-value indicate that while the alignments may be different, they result in the similar phylogenetic trees. (Note that we don't know which of these trees is better based on this test). "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from skbio import TreeNode\n",
      "from skbio.stats.distance import mantel\n",
      "\n",
      "t1 = TreeNode.read('core-set-aligned/seqs_aligned_pfiltered.tre')\n",
      "t2 = TreeNode.read('gg-aligned/seqs_aligned_pfiltered.tre')\n",
      "\n",
      "mantel(t1.tip_tip_distances(), t2.tip_tip_distances(), strict=False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "(0.9121502058163361, 0.001, 9780)"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:c22a2a0abde493467a7776a38e441147887f691d16cc5430d2be4353a1612930"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from qiime.test import write_test_data\n",
	"\n",
	"# A random 1% of the representative sequences from a de novo OTU\n",
	"# picking run on the Moving Pictures of the Human Microbiome dataset\n",
	"# (this is biased toward human microbiome OTUs)\n",
	"input_seqs_fp = \"moving-picture-rep-seqs.01.fna\"\n",
	"\n",
	"# sequences derived from my taxonomic assigner comparison:\n",
	"# https://github.com/gregcaporaso/short-read-tax-assignment\n",
	"# this is a randomly selected subset of GG 13_8 97% OTUs, sliced to the \n",
	"# 515F/806R amplicon region. (this is limited in practical applicability \n",
	"# because they are known and trusted reference sequences)\n",
	"input_seqs_fp = \"/Users/caporaso/code/short-read-tax-assignment/data/simulated-community/B1/rep_set.fna\"\n",
	"!rm -r core-set-aligned gg-aligned\n",
	"!cp $input_seqs_fp seqs\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"rm: core-set-aligned: No such file or directory\r\n",
	"rm: gg-aligned: No such file or directory\r\n"
	]
	},
	{
	"output_type": "stream",
	"stream": "stderr",
	"text": [
	"/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/matplotlib/__init__.py:1155: UserWarning: This call to matplotlib.use() has no effect\n",
	"because the backend has already been chosen;\n",
	"matplotlib.use() must be called before pylab, matplotlib.pyplot,\n",
	"or matplotlib.backends is imported for the first time.\n",
	"\n",
	" warnings.warn(_use_error_msg)\n"
	]
	}
	],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Run PyNAST, first using the core set and then using the Greengenes 13_8 85% reference alignment as the template. I picked 85% as it is most similar in number of sequences to the core set, and it doesn't perform much differently than 82% or 88% (which I tested with this same workflow). "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!align_seqs.py -t /Users/caporaso/data/greengenes_core_sets/core_set_aligned.fasta.imputed -m pynast -o core-set-aligned -i seqs"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 2
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!align_seqs.py -t /Users/caporaso/data/gg_13_8_otus/rep_set_aligned/85_otus.fasta -o gg-aligned -i seqs"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Compute the md5sum for each of the resulting alignments. This indicates that there are differences in the resulting alignments."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!md5 core-set-aligned/seqs_aligned.fasta gg-aligned/seqs_aligned.fasta"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"MD5 (core-set-aligned/seqs_aligned.fasta) = a41f90e7faa0805c4402ba25147db587\r\n"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"MD5 (gg-aligned/seqs_aligned.fasta) = 2a5baa9e0f17e1e0527ee289e51767cb\r\n"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from skbio import Alignment\n",
	"a1 = Alignment.read('core-set-aligned/seqs_aligned.fasta')\n",
	"a2 = Alignment.read('gg-aligned/seqs_aligned.fasta')\n",
	"\n",
	"a1_ids = set(a1.ids())\n",
	"a2_ids = set(a2.ids())\n",
	"common_ids = a1_ids & a2_ids\n",
	"\n",
	"print \"Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:\"\n",
	"print len(a1_ids - a2_ids), a1_ids - a2_ids\n",
	"print \"Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:\"\n",
	"print len(a2_ids - a1_ids), a2_ids - a1_ids\n",
	"print \"Number of sequences that are in both alignments:\"\n",
	"print len(common_ids)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"Number of sequences retained in the core-set-aligned data, but dropped in the GG-aligned data:\n",
	"114 set(['1138415', '620458', '787185', '4422781', '1133875', '1139993', '70116', '1137286', '805909', '580539', '637375', '4334501', '316473', '2376256', '241454', '4466525', '231442', '1123821', '294168', '1931229', '4351546', '112266', '1128568', '1147143', '1123941', '144181', '4395261', '1123374', '4419449', '1141371', '320667', '163153', '542199', '104706', '726766', '1123693', '4390557', '4413535', '575851', '262538', '4352288', '4434861', '185692', '146335', '3190394', '165845', '250116', '32422', '1138080', '252674', '181083', '168712', '834883', '104391', '328080', '140938', '161026', '4433573', '231374', '4370558', '1132594', '4341657', '805753', '1136495', '224123', '1139850', '3732401', '535932', '1148214', '612302', '4420556', '4431104', '4461578', '1140108', '526468', '4442702', '4409907', '835896', '170412', '3136117', '330263', '4299531', '4311115', '532740', '557526', '1137157', '1121115', '4423597', '4354640', '4385539', '812471', '3269889', '524865', '1134236', '322278', '4314389', '2772147', '93366', '4314942', '4459226', '4471196', '168172', '70586', '587064', '2209962', '560746', '563367', '571277', '4434859', '533634', '1141904', '620479', '1147974', '1113318'])\n",
	"Number of sequences retained in the GG-aligned data, but dropped in the core-set-aligneddata:\n",
	"8 set(['4478591', '3836963', '951205', '669760', '3770699', '1638796', '1742076', '3383796'])\n",
	"Number of sequences that are in both alignments:\n",
	"9780\n"
	]
	}
	],
	"prompt_number": 5
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"a1 = a1.subalignment(seqs_to_keep=common_ids)\n",
	"a2 = a2.subalignment(seqs_to_keep=common_ids)\n",
	"a1.write('core-set-aligned/seqs_aligned.fasta')\n",
	"a2.write('gg-aligned/seqs_aligned.fasta')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Apply the Lane mask (i.e., static entropy) filtering and gap filtering to the alignment."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!filter_alignment.py -i core-set-aligned/seqs_aligned.fasta -o core-set-aligned/"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 7
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!filter_alignment.py -i gg-aligned/seqs_aligned.fasta -o gg-aligned/"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 8
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Compute the md5sum for each of the resulting filtered alignments. This indicates that there are still differences in the resulting alignments."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!md5 core-set-aligned/seqs_aligned_pfiltered.fasta gg-aligned/seqs_aligned_pfiltered.fasta "
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"MD5 (core-set-aligned/seqs_aligned_pfiltered.fasta) = 1c261ab57bdfefaa1a99fce00a365972\r\n"
	]
	},
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"MD5 (gg-aligned/seqs_aligned_pfiltered.fasta) = bda31860b7b4f391b302ea06e871152d\r\n"
	]
	}
	],
	"prompt_number": 9
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Build phylogenetic trees with FastTree from the filtered alignments."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!make_phylogeny.py -i core-set-aligned/seqs_aligned_pfiltered.fasta -o core-set-aligned/seqs_aligned_pfiltered.tre"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 10
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"!make_phylogeny.py -i gg-aligned/seqs_aligned_pfiltered.fasta -o gg-aligned/seqs_aligned_pfiltered.tre"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 11
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Load the trees into skbio.TreeNode objects, compute DistanceMatrices of tip-to-tip distances for each, and test whether those are correlated with a Mantel test. The fairly high Mantel test statistic, and significant p-value indicate that while the alignments may be different, they result in the similar phylogenetic trees. (Note that we don't know which of these trees is better based on this test). "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from skbio import TreeNode\n",
	"from skbio.stats.distance import mantel\n",
	"\n",
	"t1 = TreeNode.read('core-set-aligned/seqs_aligned_pfiltered.tre')\n",
	"t2 = TreeNode.read('gg-aligned/seqs_aligned_pfiltered.tre')\n",
	"\n",
	"mantel(t1.tip_tip_distances(), t2.tip_tip_distances(), strict=False)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 12,
	"text": [
	"(0.9121502058163361, 0.001, 9780)"
	]
	}
	],
	"prompt_number": 12
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 12
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 12
	}
	],
	"metadata": {}
	}
	]
	}