BenKaehler/silva-tree.ipynb

## silva-tree.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1195465f",
   "metadata": {},
   "source": [
    "# Build a SILVA tree using FastTree\n",
    "\n",
    "This script can\n",
    "\n",
    "1. Download the SILVA 138 NR99 alignment.\n",
    "1. Filter columns to reduce the number of gaps.\n",
    "1. Build a tree using the resulting alignment.\n",
    "\n",
    "The goal is to build a tree that matches the SILVA 138 QIIME 2 reference data sets by almost following the procedure they used to build the [SILVA 128 tree for SEPP](https://github.com/smirarab/sepp-refs/blob/master/silva/README.md).\n",
    "\n",
    "So start by downloading the aligned SILVA 138 SSU NR99 sequences (1.1G file)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e350e321",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f232779",
   "metadata": {},
   "source": [
    "Also grab the QIIME 2 SILVA 138 reference sequences to check that the labels match, at least."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a777f704",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://data.qiime2.org/2021.4/common/silva-138-99-seqs.qza"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "532efac4",
   "metadata": {},
   "source": [
    "We're going to do this in Python because the other tools I used on my laptop were too slow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "46ea3361",
   "metadata": {},
   "outputs": [],
   "source": [
    "from qiime2 import Artifact, __version__\n",
    "import skbio\n",
    "from q2_types.feature_data import DNAIterator\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f97c5367",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using QIIME 2 2021.4.0\n"
     ]
    }
   ],
   "source": [
    "print(f'Using QIIME 2 {__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08e35c81",
   "metadata": {},
   "source": [
    "## Import the SILVA alignment\n",
    "\n",
    "Figure out what labels are present in the QIIME 2 reference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a8868e7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "qiime_seqs = Artifact.load('silva-138-99-seqs.qza')\n",
    "qiime_labels = {s.metadata['id'] for s in qiime_seqs.view(DNAIterator)}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8e5f1e",
   "metadata": {},
   "source": [
    "Convert the SILVA sequences from RNA to DNA and import them into a QIIME Artifact, but only if they're in the QIIME 2 reference set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6513dac1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'silva-138-99-aln.qza'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "silva_aln = skbio.io.read('SILVA_138_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz', format='fasta')\n",
    "silva_labels = []\n",
    "def converter():\n",
    "    for seq in silva_aln:\n",
    "        seq = skbio.RNA(seq).reverse_transcribe()\n",
    "        if seq.metadata['id'] not in qiime_labels:\n",
    "            continue\n",
    "        silva_labels.append(seq.metadata['id'])\n",
    "        del seq.metadata['description']\n",
    "        yield seq\n",
    "silva_aln = skbio.alignment.TabularMSA(converter())\n",
    "silva_aln = Artifact.import_data('FeatureData[AlignedSequence]', silva_aln)\n",
    "silva_aln.save('silva-138-99-aln.qza')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8c8a590",
   "metadata": {},
   "source": [
    "Check that the SILVA and QIIME SILVA labels are the same, at least."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "84ec44ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert set(silva_labels) == qiime_labels, \"QIIME SILVA and SILVA labels don't match\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7db18bca",
   "metadata": {},
   "source": [
    "Memory is full; restart the kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6f70162",
   "metadata": {},
   "outputs": [],
   "source": [
    "# thanks https://stackoverflow.com/questions/37751120/restart-ipython-kernel-with-a-command-from-a-cell\n",
    "import os\n",
    "os._exit(00)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a61c50b3",
   "metadata": {},
   "source": [
    "## Mask the columns that are very gappy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f69157ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "from qiime2 import Artifact\n",
    "import skbio\n",
    "from q2_types.feature_data import DNAIterator\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08a74572",
   "metadata": {},
   "source": [
    "Read the aligned sequences into a NumPy array for fast access by column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "867410c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "silva_aln = Artifact.load('silva-138-99-aln.qza')\n",
    "seq_array = []\n",
    "for seq in silva_aln.view(DNAIterator):\n",
    "    seq_array.append(seq.values)\n",
    "seq_array = np.array(seq_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34a3c14c",
   "metadata": {},
   "source": [
    "Count the gaps in each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "13ce675a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done 0 columns\n",
      "Done 10000 columns\n",
      "Done 20000 columns\n",
      "Done 30000 columns\n",
      "Done 40000 columns\n"
     ]
    }
   ],
   "source": [
    "num_gaps = np.empty(seq_array.shape[1])\n",
    "for j in range(seq_array.shape[1]):\n",
    "    num_gaps[j] = (seq_array[:,j] == b'.').sum()\n",
    "    num_gaps[j] += (seq_array[:,j] == b'-').sum()\n",
    "    if j % 10000 == 0:\n",
    "        print(f'Done {j} columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f6b2067",
   "metadata": {},
   "source": [
    "Find the less gappy columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "720e05b2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Keeping 2349 columns that are less than 99.56% gaps\n"
     ]
    }
   ],
   "source": [
    "num_ok = 1907\n",
    "ix = num_gaps <= seq_array.shape[0] - num_ok\n",
    "print(f'Keeping {ix.sum()} columns that are less than {100 - num_ok/seq_array.shape[0]*100:.2f}% gaps')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9b50cf84",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'silva-138-99-masked-aln.qza'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "masked = skbio.alignment.TabularMSA(s[ix] for s in silva_aln.view(DNAIterator))\n",
    "masked = Artifact.import_data('FeatureData[AlignedSequence]', masked)\n",
    "masked.save('silva-138-99-masked-aln.qza')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b0ae923",
   "metadata": {},
   "source": [
    "Memory is full; restart the kernel, again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ae3f27d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# thanks https://stackoverflow.com/questions/37751120/restart-ipython-kernel-with-a-command-from-a-cell\n",
    "import os\n",
    "os._exit(00)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36a8ca9e",
   "metadata": {},
   "source": [
    "## Build a tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "02519a33",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Running external command line application. This may print messages to stdout and/or stderr.\n",
      "The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.\n",
      "\n",
      "Command: FastTreeMP -quote -nt /tmp/qiime2-archive-jwdjc355/53b48244-84d0-42ce-8db8-41fc6e44c4d6/data/aligned-dna-sequences.fasta\n",
      "\n",
      "FastTree Version 2.1.10 Double precision (No SSE3), OpenMP (11 threads)\n",
      "Alignment: /tmp/qiime2-archive-jwdjc355/53b48244-84d0-42ce-8db8-41fc6e44c4d6/data/aligned-dna-sequences.fasta\n",
      "Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000\n",
      "Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1\n",
      "TopHits: 1.00*sqrtN close=default refresh=0.80\n",
      "ML Model: Jukes-Cantor, CAT approximation with 20 rate categories\n",
      "Warning! Found \".\" character(s). These are treated as gaps\n",
      "Ignored unknown character B (seen 116 times)ces   \n",
      "Ignored unknown character D (seen 70 times)\n",
      "Ignored unknown character H (seen 104 times)\n",
      "Ignored unknown character K (seen 2391 times)\n",
      "Ignored unknown character M (seen 1824 times)\n",
      "Ignored unknown character R (seen 5381 times)\n",
      "Ignored unknown character S (seen 3113 times)\n",
      "Ignored unknown character V (seen 90 times)\n",
      "Ignored unknown character W (seen 1856 times)\n",
      "Ignored unknown character X (seen 41223 times)\n",
      "Ignored unknown character Y (seen 4899 times)\n",
      "Initial topology in 16323.01 seconds 430269   01 of 430272 seqs   0200)   Top hits for 430922 of 430272 seqs (at seed 363600)   Joined   6600 of 430269      Joined 220700 of 430269   Joined 336000 of 430269   \n",
      "Refining topology: 75 rounds ME-NNIs, 2 rounds ME-SPRs, 37 rounds ML-NNIs\n",
      "Total branch-length 8065.091 after 21601.61 secof 430270 splits, 2 changes (max delta 0.000)          \n",
      "ML-NNI round 1: LogLk = -81445405.515 NNIs 112700 max delta 130.87 Time 23343.51s (max delta 130.869)   I round 1 of 37, 9301 of 430270 splits, 2614 changes (max delta 48.181)   ML NNI round 1 of 37, 23901 of 430270 splits, 6443 changes (max delta 63.982)   ML NNI round 1 of 37, 300401 of 430270 splits, 79446 changes (max delta 130.869)   ML NNI round 1 of 37, 385101 of 430270 splits, 97227 changes (max delta 130.869)   \n",
      "Switched to using 20 rate categories (CAT approximation)20 of 20   \n",
      "Rate categories were divided by 2.018 so that average rate = 1.0\n",
      "CAT-based log-likelihoods may not be comparable across runs\n",
      "Use -gamma for approximate but comparable Gamma(20) log-likelihoods\n",
      "ML-NNI round 2: LogLk = -73371346.033 NNIs 66381 max delta 116.22 Time 25367.09s (max delta 116.217)    NNI round 2 of 37, 53601 of 430270 splits, 10862 changes (max delta 47.719)   ML NNI round 2 of 37, 53701 of 430270 splits, 10878 changes (max delta 47.719)   ML NNI round 2 of 37, 145601 of 430270 splits, 24503 changes (max delta 54.956)   ML NNI round 2 of 37, 288101 of 430270 splits, 46339 changes (max delta 116.217)   \n",
      "ML-NNI round 3: LogLk = -73294484.312 NNIs 38116 max delta 71.17 Time 26389.31es (max delta 71.175)    NNI round 3 of 37, 69501 of 430270 splits, 6879 changes (max delta 71.175)   ML NNI round 3 of 37, 272901 of 430270 splits, 28722 changes (max delta 71.175)   ML NNI round 3 of 37, 286401 of 430270 splits, 30284 changes (max delta 71.175)   \n",
      "ML-NNI round 4: LogLk = -73258517.258 NNIs 20876 max delta 91.24 Time 27130.19es (max delta 91.241)   \n",
      "ML-NNI round 5: LogLk = -73240131.639 NNIs 12375 max delta 120.25 Time 27601.05s (max delta 120.249)   \n",
      "ML-NNI round 6: LogLk = -73229939.999 NNIs 7989 max delta 75.15 Time 27907.18s (max delta 75.153)   ML NNI round 6 of 37, 14501 of 430270 splits, 1058 changes (max delta 34.488)   \n",
      "ML-NNI round 7: LogLk = -73224066.579 NNIs 5287 max delta 101.87 Time 28110.66 (max delta 101.867)   ML NNI round 7 of 37, 22801 of 430270 splits, 1470 changes (max delta 101.867)   \n",
      "ML-NNI round 8: LogLk = -73220425.117 NNIs 3512 max delta 66.91 Time 28251.52s (max delta 66.911)   \n",
      "ML-NNI round 9: LogLk = -73218153.284 NNIs 2364 max delta 91.36 Time 28352.08s (max delta 91.361)   \n",
      "ML-NNI round 10: LogLk = -73216388.792 NNIs 1622 max delta 136.10 Time 28428.05 (max delta 136.103)   \n",
      "ML-NNI round 11: LogLk = -73215346.756 NNIs 1175 max delta 91.78 Time 28485.85s (max delta 91.778)   \n",
      "ML-NNI round 12: LogLk = -73214458.588 NNIs 798 max delta 53.89 Time 28533.24s (max delta 53.887)   \n",
      "ML-NNI round 13: LogLk = -73213612.017 NNIs 671 max delta 132.09 Time 28574.27(max delta 132.092)   ML NNI round 13 of 37, 2901 of 430270 splits, 186 changes (max delta 132.092)   \n",
      "ML-NNI round 14: LogLk = -73212960.148 NNIs 516 max delta 65.38 Time 28610.28 (max delta 65.384)   \n",
      "ML-NNI round 15: LogLk = -73212765.953 NNIs 396 max delta 27.35 Time 28642.19 (max delta 27.351)   \n",
      "ML-NNI round 16: LogLk = -73212534.279 NNIs 302 max delta 32.43 Time 28671.03 (max delta 32.428)   \n",
      "ML-NNI round 17: LogLk = -73212369.363 NNIs 299 max delta 21.13 Time 28698.11 (max delta 21.130)   \n",
      "ML-NNI round 18: LogLk = -73212261.401 NNIs 179 max delta 19.88 Time 28723.01 (max delta 19.878)   \n",
      "ML-NNI round 19: LogLk = -73212221.437 NNIs 166 max delta 12.08 Time 28747.46 (max delta 12.075)   \n",
      "ML-NNI round 20: LogLk = -73212146.780 NNIs 122 max delta 12.69 Time 28770.78 (max delta 12.689)   \n",
      "ML-NNI round 21: LogLk = -73212007.816 NNIs 92 max delta 43.22 Time 28793.34(max delta 43.216)   \n",
      "ML-NNI round 22: LogLk = -73211953.871 NNIs 70 max delta 17.07 Time 28815.83(max delta 17.070)   \n",
      "ML-NNI round 23: LogLk = -73211932.039 NNIs 71 max delta 5.84 Time 28838.10 (max delta 5.837)   \n",
      "ML-NNI round 24: LogLk = -73211916.012 NNIs 58 max delta 3.99 Time 28860.19 (max delta 3.986)   \n",
      "ML-NNI round 25: LogLk = -73211888.911 NNIs 49 max delta 7.92 Time 28882.06 (max delta 7.925)   \n",
      "ML-NNI round 26: LogLk = -73211858.272 NNIs 40 max delta 18.86 Time 28903.79(max delta 18.863)   \n",
      "ML-NNI round 27: LogLk = -73211842.138 NNIs 41 max delta 4.75 Time 28925.55 (max delta 4.752)   \n",
      "ML-NNI round 28: LogLk = -73211815.615 NNIs 42 max delta 17.33 Time 28947.37(max delta 17.332)   \n",
      "ML-NNI round 29: LogLk = -73211801.065 NNIs 37 max delta 9.52 Time 28969.24 (max delta 9.518)   \n",
      "ML-NNI round 30: LogLk = -73211783.503 NNIs 31 max delta 15.21 Time 28990.92(max delta 15.210)   \n",
      "ML-NNI round 31: LogLk = -73211773.813 NNIs 10 max delta 9.01 Time 29012.13(max delta 9.014)   \n",
      "ML-NNI round 32: LogLk = -73211754.508 NNIs 41 max delta 18.22 Time 29033.33(max delta 18.222)   \n",
      "ML-NNI round 33: LogLk = -73211754.454 NNIs 13 max delta 0.00 Time 29054.49 (max delta 0.000)   \n",
      "Turning off heuristics for final round of ML NNIs (converged)\n",
      "ML-NNI round 34: LogLk = -73203675.246 NNIs 6455 max delta 30.97 Time 30298.05 (final)delta 30.972)   ML NNI round 34 of 37, 384801 of 430270 splits, 5296 changes (max delta 30.972)   ML NNI round 34 of 37, 417601 of 430270 splits, 6008 changes (max delta 30.972)   ML NNI round 34 of 37, 422401 of 430270 splits, 6169 changes (max delta 30.972)   \n",
      "Optimize all lengths: LogLk = -73203091.650 Time 30693.62\n",
      "Total time: 33239.79 seconds Unique: 430272/436680 Bad splits: 684/430269 Worst delta-LogLk 61.3140 of 430269 internal splits   ML split tests for  38700 of 430269 internal splits   ML split tests for 143100 of 430269 internal splits   ML split tests for 149300 of 430269 internal splits   ML split tests for 415700 of 430269 internal splits   \n",
      "\u001b[32mSaved Phylogeny[Unrooted] to: silva-138-99-tree.qza\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!qiime phylogeny fasttree --i-alignment silva-138-99-masked-aln.qza --o-tree silva-138-99-tree.qza --p-n-threads 11 --verbose"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "771e30d1",
   "metadata": {},
   "source": [
    "For completeness, create a midpoint-rooted version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "06266927",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32mSaved Phylogeny[Rooted] to: silva-138-99-rooted-tree.qza\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "!qiime phylogeny midpoint-root --i-tree silva-138-99-tree.qza --o-rooted-tree silva-138-99-rooted-tree.qza"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "169a348b",
   "metadata": {},
   "source": [
    "## Downloads\n",
    "\n",
    "Unrooted tree [here](https://github.com/BenKaehler/silva-trees/blob/main/silva-138-99-tree.qza) and rooted tree [here](https://github.com/BenKaehler/silva-trees/blob/main/silva-138-99-rooted-tree.qza).\n",
    "\n",
    "## Did it work?\n",
    "If the objective was to create a tree, it worked. Subsample 1,000 tips and load into https://itol.embl.de/."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9135fa88",
   "metadata": {},
   "outputs": [],
   "source": [
    "from skbio import TreeNode\n",
    "from qiime2 import Artifact\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3c28b8c5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'silva-138-99-subtree-629.qza'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tree = Artifact.load('silva-138-99-rooted-tree.qza').view(TreeNode)\n",
    "names = [n.name for n in tree.tips()]\n",
    "keepers = set(np.random.choice(names, size=1000, replace=False))\n",
    "while set(n.name for n in tree.tips()) != keepers:\n",
    "    tree.remove_deleted(lambda n: n.is_tip() and n.name not in keepers)\n",
    "    tree.prune()\n",
    "tree = Artifact.import_data('Phylogeny[Rooted]', tree)\n",
    "subnum = np.random.randint(1000)\n",
    "tree.save(f'silva-138-99-subtree-{subnum}.qza')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3f46d36",
   "metadata": {},
   "source": [
    "![](subtree-661.svg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

## subtree-661.svg

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              subtree-661.svg
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "1195465f",
	"metadata": {},
	"source": [
	"# Build a SILVA tree using FastTree\n",
	"\n",
	"This script can\n",
	"\n",
	"1. Download the SILVA 138 NR99 alignment.\n",
	"1. Filter columns to reduce the number of gaps.\n",
	"1. Build a tree using the resulting alignment.\n",
	"\n",
	"The goal is to build a tree that matches the SILVA 138 QIIME 2 reference data sets by almost following the procedure they used to build the [SILVA 128 tree for SEPP](https://github.com/smirarab/sepp-refs/blob/master/silva/README.md).\n",
	"\n",
	"So start by downloading the aligned SILVA 138 SSU NR99 sequences (1.1G file)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "e350e321",
	"metadata": {},
	"outputs": [],
	"source": [
	"!wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz"
	]
	},
	{
	"cell_type": "markdown",
	"id": "3f232779",
	"metadata": {},
	"source": [
	"Also grab the QIIME 2 SILVA 138 reference sequences to check that the labels match, at least."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "a777f704",
	"metadata": {},
	"outputs": [],
	"source": [
	"!wget https://data.qiime2.org/2021.4/common/silva-138-99-seqs.qza"
	]
	},
	{
	"cell_type": "markdown",
	"id": "532efac4",
	"metadata": {},
	"source": [
	"We're going to do this in Python because the other tools I used on my laptop were too slow."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "46ea3361",
	"metadata": {},
	"outputs": [],
	"source": [
	"from qiime2 import Artifact, __version__\n",
	"import skbio\n",
	"from q2_types.feature_data import DNAIterator\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "f97c5367",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Using QIIME 2 2021.4.0\n"
	]
	}
	],
	"source": [
	"print(f'Using QIIME 2 {__version__}')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "08e35c81",
	"metadata": {},
	"source": [
	"## Import the SILVA alignment\n",
	"\n",
	"Figure out what labels are present in the QIIME 2 reference."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "a8868e7a",
	"metadata": {},
	"outputs": [],
	"source": [
	"qiime_seqs = Artifact.load('silva-138-99-seqs.qza')\n",
	"qiime_labels = {s.metadata['id'] for s in qiime_seqs.view(DNAIterator)}"
	]
	},
	{
	"cell_type": "markdown",
	"id": "cf8e5f1e",
	"metadata": {},
	"source": [
	"Convert the SILVA sequences from RNA to DNA and import them into a QIIME Artifact, but only if they're in the QIIME 2 reference set."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "6513dac1",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'silva-138-99-aln.qza'"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"silva_aln = skbio.io.read('SILVA_138_SSURef_NR99_tax_silva_full_align_trunc.fasta.gz', format='fasta')\n",
	"silva_labels = []\n",
	"def converter():\n",
	" for seq in silva_aln:\n",
	" seq = skbio.RNA(seq).reverse_transcribe()\n",
	" if seq.metadata['id'] not in qiime_labels:\n",
	" continue\n",
	" silva_labels.append(seq.metadata['id'])\n",
	" del seq.metadata['description']\n",
	" yield seq\n",
	"silva_aln = skbio.alignment.TabularMSA(converter())\n",
	"silva_aln = Artifact.import_data('FeatureData[AlignedSequence]', silva_aln)\n",
	"silva_aln.save('silva-138-99-aln.qza')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d8c8a590",
	"metadata": {},
	"source": [
	"Check that the SILVA and QIIME SILVA labels are the same, at least."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "84ec44ea",
	"metadata": {},
	"outputs": [],
	"source": [
	"assert set(silva_labels) == qiime_labels, \"QIIME SILVA and SILVA labels don't match\""
	]
	},
	{
	"cell_type": "markdown",
	"id": "7db18bca",
	"metadata": {},
	"source": [
	"Memory is full; restart the kernel."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "b6f70162",
	"metadata": {},
	"outputs": [],
	"source": [
	"# thanks https://stackoverflow.com/questions/37751120/restart-ipython-kernel-with-a-command-from-a-cell\n",
	"import os\n",
	"os._exit(00)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "a61c50b3",
	"metadata": {},
	"source": [
	"## Mask the columns that are very gappy"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "f69157ab",
	"metadata": {},
	"outputs": [],
	"source": [
	"from qiime2 import Artifact\n",
	"import skbio\n",
	"from q2_types.feature_data import DNAIterator\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"id": "08a74572",
	"metadata": {},
	"source": [
	"Read the aligned sequences into a NumPy array for fast access by column."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "867410c4",
	"metadata": {},
	"outputs": [],
	"source": [
	"silva_aln = Artifact.load('silva-138-99-aln.qza')\n",
	"seq_array = []\n",
	"for seq in silva_aln.view(DNAIterator):\n",
	" seq_array.append(seq.values)\n",
	"seq_array = np.array(seq_array)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "34a3c14c",
	"metadata": {},
	"source": [
	"Count the gaps in each column."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "13ce675a",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Done 0 columns\n",
	"Done 10000 columns\n",
	"Done 20000 columns\n",
	"Done 30000 columns\n",
	"Done 40000 columns\n"
	]
	}
	],
	"source": [
	"num_gaps = np.empty(seq_array.shape[1])\n",
	"for j in range(seq_array.shape[1]):\n",
	" num_gaps[j] = (seq_array[:,j] == b'.').sum()\n",
	" num_gaps[j] += (seq_array[:,j] == b'-').sum()\n",
	" if j % 10000 == 0:\n",
	" print(f'Done {j} columns')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7f6b2067",
	"metadata": {},
	"source": [
	"Find the less gappy columns."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "720e05b2",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Keeping 2349 columns that are less than 99.56% gaps\n"
	]
	}
	],
	"source": [
	"num_ok = 1907\n",
	"ix = num_gaps <= seq_array.shape[0] - num_ok\n",
	"print(f'Keeping {ix.sum()} columns that are less than {100 - num_ok/seq_array.shape[0]*100:.2f}% gaps')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "9b50cf84",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'silva-138-99-masked-aln.qza'"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"masked = skbio.alignment.TabularMSA(s[ix] for s in silva_aln.view(DNAIterator))\n",
	"masked = Artifact.import_data('FeatureData[AlignedSequence]', masked)\n",
	"masked.save('silva-138-99-masked-aln.qza')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5b0ae923",
	"metadata": {},
	"source": [
	"Memory is full; restart the kernel, again."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "1ae3f27d",
	"metadata": {},
	"outputs": [],
	"source": [
	"# thanks https://stackoverflow.com/questions/37751120/restart-ipython-kernel-with-a-command-from-a-cell\n",
	"import os\n",
	"os._exit(00)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "36a8ca9e",
	"metadata": {},
	"source": [
	"## Build a tree"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "02519a33",
	"metadata": {
	"scrolled": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Running external command line application. This may print messages to stdout and/or stderr.\n",
	"The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.\n",
	"\n",
	"Command: FastTreeMP -quote -nt /tmp/qiime2-archive-jwdjc355/53b48244-84d0-42ce-8db8-41fc6e44c4d6/data/aligned-dna-sequences.fasta\n",
	"\n",
	"FastTree Version 2.1.10 Double precision (No SSE3), OpenMP (11 threads)\n",
	"Alignment: /tmp/qiime2-archive-jwdjc355/53b48244-84d0-42ce-8db8-41fc6e44c4d6/data/aligned-dna-sequences.fasta\n",
	"Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000\n",
	"Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1\n",
	"TopHits: 1.00*sqrtN close=default refresh=0.80\n",
	"ML Model: Jukes-Cantor, CAT approximation with 20 rate categories\n",
	"Warning! Found \".\" character(s). These are treated as gaps\n",
	"Ignored unknown character B (seen 116 times)ces \n",
	"Ignored unknown character D (seen 70 times)\n",
	"Ignored unknown character H (seen 104 times)\n",
	"Ignored unknown character K (seen 2391 times)\n",
	"Ignored unknown character M (seen 1824 times)\n",
	"Ignored unknown character R (seen 5381 times)\n",
	"Ignored unknown character S (seen 3113 times)\n",
	"Ignored unknown character V (seen 90 times)\n",
	"Ignored unknown character W (seen 1856 times)\n",
	"Ignored unknown character X (seen 41223 times)\n",
	"Ignored unknown character Y (seen 4899 times)\n",
	"Initial topology in 16323.01 seconds 430269 01 of 430272 seqs 0200) Top hits for 430922 of 430272 seqs (at seed 363600) Joined 6600 of 430269 Joined 220700 of 430269 Joined 336000 of 430269 \n",
	"Refining topology: 75 rounds ME-NNIs, 2 rounds ME-SPRs, 37 rounds ML-NNIs\n",
	"Total branch-length 8065.091 after 21601.61 secof 430270 splits, 2 changes (max delta 0.000) \n",
	"ML-NNI round 1: LogLk = -81445405.515 NNIs 112700 max delta 130.87 Time 23343.51s (max delta 130.869) I round 1 of 37, 9301 of 430270 splits, 2614 changes (max delta 48.181) ML NNI round 1 of 37, 23901 of 430270 splits, 6443 changes (max delta 63.982) ML NNI round 1 of 37, 300401 of 430270 splits, 79446 changes (max delta 130.869) ML NNI round 1 of 37, 385101 of 430270 splits, 97227 changes (max delta 130.869) \n",
	"Switched to using 20 rate categories (CAT approximation)20 of 20 \n",
	"Rate categories were divided by 2.018 so that average rate = 1.0\n",
	"CAT-based log-likelihoods may not be comparable across runs\n",
	"Use -gamma for approximate but comparable Gamma(20) log-likelihoods\n",
	"ML-NNI round 2: LogLk = -73371346.033 NNIs 66381 max delta 116.22 Time 25367.09s (max delta 116.217) NNI round 2 of 37, 53601 of 430270 splits, 10862 changes (max delta 47.719) ML NNI round 2 of 37, 53701 of 430270 splits, 10878 changes (max delta 47.719) ML NNI round 2 of 37, 145601 of 430270 splits, 24503 changes (max delta 54.956) ML NNI round 2 of 37, 288101 of 430270 splits, 46339 changes (max delta 116.217) \n",
	"ML-NNI round 3: LogLk = -73294484.312 NNIs 38116 max delta 71.17 Time 26389.31es (max delta 71.175) NNI round 3 of 37, 69501 of 430270 splits, 6879 changes (max delta 71.175) ML NNI round 3 of 37, 272901 of 430270 splits, 28722 changes (max delta 71.175) ML NNI round 3 of 37, 286401 of 430270 splits, 30284 changes (max delta 71.175) \n",
	"ML-NNI round 4: LogLk = -73258517.258 NNIs 20876 max delta 91.24 Time 27130.19es (max delta 91.241) \n",
	"ML-NNI round 5: LogLk = -73240131.639 NNIs 12375 max delta 120.25 Time 27601.05s (max delta 120.249) \n",
	"ML-NNI round 6: LogLk = -73229939.999 NNIs 7989 max delta 75.15 Time 27907.18s (max delta 75.153) ML NNI round 6 of 37, 14501 of 430270 splits, 1058 changes (max delta 34.488) \n",
	"ML-NNI round 7: LogLk = -73224066.579 NNIs 5287 max delta 101.87 Time 28110.66 (max delta 101.867) ML NNI round 7 of 37, 22801 of 430270 splits, 1470 changes (max delta 101.867) \n",
	"ML-NNI round 8: LogLk = -73220425.117 NNIs 3512 max delta 66.91 Time 28251.52s (max delta 66.911) \n",
	"ML-NNI round 9: LogLk = -73218153.284 NNIs 2364 max delta 91.36 Time 28352.08s (max delta 91.361) \n",
	"ML-NNI round 10: LogLk = -73216388.792 NNIs 1622 max delta 136.10 Time 28428.05 (max delta 136.103) \n",
	"ML-NNI round 11: LogLk = -73215346.756 NNIs 1175 max delta 91.78 Time 28485.85s (max delta 91.778) \n",
	"ML-NNI round 12: LogLk = -73214458.588 NNIs 798 max delta 53.89 Time 28533.24s (max delta 53.887) \n",
	"ML-NNI round 13: LogLk = -73213612.017 NNIs 671 max delta 132.09 Time 28574.27(max delta 132.092) ML NNI round 13 of 37, 2901 of 430270 splits, 186 changes (max delta 132.092) \n",
	"ML-NNI round 14: LogLk = -73212960.148 NNIs 516 max delta 65.38 Time 28610.28 (max delta 65.384) \n",
	"ML-NNI round 15: LogLk = -73212765.953 NNIs 396 max delta 27.35 Time 28642.19 (max delta 27.351) \n",
	"ML-NNI round 16: LogLk = -73212534.279 NNIs 302 max delta 32.43 Time 28671.03 (max delta 32.428) \n",
	"ML-NNI round 17: LogLk = -73212369.363 NNIs 299 max delta 21.13 Time 28698.11 (max delta 21.130) \n",
	"ML-NNI round 18: LogLk = -73212261.401 NNIs 179 max delta 19.88 Time 28723.01 (max delta 19.878) \n",
	"ML-NNI round 19: LogLk = -73212221.437 NNIs 166 max delta 12.08 Time 28747.46 (max delta 12.075) \n",
	"ML-NNI round 20: LogLk = -73212146.780 NNIs 122 max delta 12.69 Time 28770.78 (max delta 12.689) \n",
	"ML-NNI round 21: LogLk = -73212007.816 NNIs 92 max delta 43.22 Time 28793.34(max delta 43.216) \n",
	"ML-NNI round 22: LogLk = -73211953.871 NNIs 70 max delta 17.07 Time 28815.83(max delta 17.070) \n",
	"ML-NNI round 23: LogLk = -73211932.039 NNIs 71 max delta 5.84 Time 28838.10 (max delta 5.837) \n",
	"ML-NNI round 24: LogLk = -73211916.012 NNIs 58 max delta 3.99 Time 28860.19 (max delta 3.986) \n",
	"ML-NNI round 25: LogLk = -73211888.911 NNIs 49 max delta 7.92 Time 28882.06 (max delta 7.925) \n",
	"ML-NNI round 26: LogLk = -73211858.272 NNIs 40 max delta 18.86 Time 28903.79(max delta 18.863) \n",
	"ML-NNI round 27: LogLk = -73211842.138 NNIs 41 max delta 4.75 Time 28925.55 (max delta 4.752) \n",
	"ML-NNI round 28: LogLk = -73211815.615 NNIs 42 max delta 17.33 Time 28947.37(max delta 17.332) \n",
	"ML-NNI round 29: LogLk = -73211801.065 NNIs 37 max delta 9.52 Time 28969.24 (max delta 9.518) \n",
	"ML-NNI round 30: LogLk = -73211783.503 NNIs 31 max delta 15.21 Time 28990.92(max delta 15.210) \n",
	"ML-NNI round 31: LogLk = -73211773.813 NNIs 10 max delta 9.01 Time 29012.13(max delta 9.014) \n",
	"ML-NNI round 32: LogLk = -73211754.508 NNIs 41 max delta 18.22 Time 29033.33(max delta 18.222) \n",
	"ML-NNI round 33: LogLk = -73211754.454 NNIs 13 max delta 0.00 Time 29054.49 (max delta 0.000) \n",
	"Turning off heuristics for final round of ML NNIs (converged)\n",
	"ML-NNI round 34: LogLk = -73203675.246 NNIs 6455 max delta 30.97 Time 30298.05 (final)delta 30.972) ML NNI round 34 of 37, 384801 of 430270 splits, 5296 changes (max delta 30.972) ML NNI round 34 of 37, 417601 of 430270 splits, 6008 changes (max delta 30.972) ML NNI round 34 of 37, 422401 of 430270 splits, 6169 changes (max delta 30.972) \n",
	"Optimize all lengths: LogLk = -73203091.650 Time 30693.62\n",
	"Total time: 33239.79 seconds Unique: 430272/436680 Bad splits: 684/430269 Worst delta-LogLk 61.3140 of 430269 internal splits ML split tests for 38700 of 430269 internal splits ML split tests for 143100 of 430269 internal splits ML split tests for 149300 of 430269 internal splits ML split tests for 415700 of 430269 internal splits \n",
	"\u001b[32mSaved Phylogeny[Unrooted] to: silva-138-99-tree.qza\u001b[0m\n"
	]
	}
	],
	"source": [
	"!qiime phylogeny fasttree --i-alignment silva-138-99-masked-aln.qza --o-tree silva-138-99-tree.qza --p-n-threads 11 --verbose"
	]
	},
	{
	"cell_type": "markdown",
	"id": "771e30d1",
	"metadata": {},
	"source": [
	"For completeness, create a midpoint-rooted version."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "06266927",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\u001b[32mSaved Phylogeny[Rooted] to: silva-138-99-rooted-tree.qza\u001b[0m\r\n"
	]
	}
	],
	"source": [
	"!qiime phylogeny midpoint-root --i-tree silva-138-99-tree.qza --o-rooted-tree silva-138-99-rooted-tree.qza"
	]
	},
	{
	"cell_type": "markdown",
	"id": "169a348b",
	"metadata": {},
	"source": [
	"## Downloads\n",
	"\n",
	"Unrooted tree [here](https://github.com/BenKaehler/silva-trees/blob/main/silva-138-99-tree.qza) and rooted tree [here](https://github.com/BenKaehler/silva-trees/blob/main/silva-138-99-rooted-tree.qza).\n",
	"\n",
	"## Did it work?\n",
	"If the objective was to create a tree, it worked. Subsample 1,000 tips and load into https://itol.embl.de/."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "9135fa88",
	"metadata": {},
	"outputs": [],
	"source": [
	"from skbio import TreeNode\n",
	"from qiime2 import Artifact\n",
	"import numpy as np\n",
	"import pandas as pd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "3c28b8c5",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'silva-138-99-subtree-629.qza'"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"tree = Artifact.load('silva-138-99-rooted-tree.qza').view(TreeNode)\n",
	"names = [n.name for n in tree.tips()]\n",
	"keepers = set(np.random.choice(names, size=1000, replace=False))\n",
	"while set(n.name for n in tree.tips()) != keepers:\n",
	" tree.remove_deleted(lambda n: n.is_tip() and n.name not in keepers)\n",
	" tree.prune()\n",
	"tree = Artifact.import_data('Phylogeny[Rooted]', tree)\n",
	"subnum = np.random.randint(1000)\n",
	"tree.save(f'silva-138-99-subtree-{subnum}.qza')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "e3f46d36",
	"metadata": {},
	"source": [
	"![](subtree-661.svg)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}