dwhswenson/what_is_nucleic.ipynb

## what_is_nucleic.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Which residues count as \"nucleic\"?\n",
    "\n",
    "It turns out this this question is a little more complicated that it would appear at first glance. The canonical PDB names for DNA and RNA residues are easy enough. But after that, we have two kinds of problems:\n",
    "\n",
    "* There are sometimes-used residue names for nucleoside variants that are not the official 3-letter names from the [Chemical Components Dictionary (CCD)](http://www.wwpdb.org/data/ccd). For example, `DA5` for adenine at the 5' end of DNA; this name is officially assigned to [another molecule](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/DA5).\n",
    "* There are residues that can play the role of nucleic acid linkers (i.e., can be part of the nucleic acid polymer) but are not the canonical bases (for example, methylated versions of bases). These can be identified by the correct listings in the CCD, which are not often included in definitions of an `is_nucleic` method.\n",
    "\n",
    "Below, I will describe ways to obtain all the relevant residues, and to provide mappings of them to the standard base abbreviations (either 3-letter or 1-letter).\n",
    "\n",
    "Most the the things here will be saved as `dict`s. The format I'm using maps the \"parent\" residue (standard DNA/RNA name, such as `DA` or `A`) to a list of its variants. This format is relatively easy to create and understand. I've added code to (a branch of) MDTraj to use this format to create the frozenset of residues, as well as the mapping of full name to the 1-letter codes. It also provides some flexibility to customize this mapping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# these are the official PDB residues\n",
    "_DNA_RESIDUES = ['DA', 'DT', 'DC', 'DG', 'DI']\n",
    "_RNA_RESIDUES = ['A', 'U', 'C', 'G', 'I']\n",
    "_OFFICIAL_VARIANTS = {key: [key] for key in _DNA_RESIDUES + _RNA_RESIDUES}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we have several residue names that aren't officially correct. I know of two main varieties here: things that use internal Amber names for the bare (`xxN`), 5' terminal (`xx5`), and 3' terminal (`xx3`) residues, and things that are supposed to represent protonated residues. Some of these may conflict with names in the CCD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "_AMBER_VARIANTS = {\n",
    "    'DA': ['DAN', 'DA5', 'DA3'],\n",
    "    'DT': ['DTN', 'DT5', 'DT3'],\n",
    "    'DC': ['DCN', 'DC5', 'DC3'],\n",
    "    'DG': ['DGN', 'DG5', 'DG3'],\n",
    "    'A': ['AN', 'A5', 'A3', 'RAN', 'RA5', 'RA3'],\n",
    "    'U': ['UN', 'U5', 'U3', 'RUN', 'RU5', 'RU3'],\n",
    "    'C': ['CN', 'C5', 'C3', 'RCN', 'RC5', 'RC3'],\n",
    "    'G': ['GN', 'G5', 'G3', 'RGN', 'RG5', 'RG3'],\n",
    "}\n",
    "\n",
    "# protonated forms originally listed in ParmEd\n",
    "_PROTONATED_VARIANTS = {\n",
    "    'DA': ['DAP'],\n",
    "    'DC': ['DCP']\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we dive into the big problem, which is that there are many molecules in the CCD that can be used in nucleic acid polymers, but aren't the standard nucleosides. The most important of these are probably the methylated variants that play an important role in epigenetic regulation. However, there are hundreds of such molecules.\n",
    "\n",
    "To run the following, you should download the `components.cif` file from the [CCD](http://www.wwpdb.org/data/ccd) and put it in the same directory as this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "awk_script = \"\"\"\n",
    "BEGIN { one_letter = \"\"; parent = \"\"; is_nucleic = 0 }\n",
    "/^data_/ { if (is_nucleic) { print one_letter, parent, last_data };\n",
    "    last_data = $0;\n",
    "    is_nucleic = 0;\n",
    "}\n",
    "/_chem_comp\\.mon_nstd_parent_comp_id/ { parent = $2 }\n",
    "/_chem_comp\\.one_letter_code/ { one_letter = $2 }\n",
    "/_chem_comp\\.type.*[RD]NA LINKING/ { is_nucleic = 1 }\n",
    "END { if (is_nucleic) { print one_letter, parent, last_data } }\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "quoted_awk = \"'\" + awk_script + \"'\"\n",
    "ccd_data = !awk {quoted_awk} components.cif"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections\n",
    "\n",
    "def get_ccd_aliases(ccd_data, disallow=[\"?\"]):\n",
    "    # default disallow ignores unknown parents\n",
    "    results = collections.defaultdict(list)\n",
    "    for line in ccd_data:\n",
    "        one_letter, parent, data_id = line.split()\n",
    "        if parent not in disallow:\n",
    "            results[parent].append(data_id[5:])\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "_CCD_VARIANTS = get_ccd_aliases(ccd_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point we have all the main data. The `*_VARIANTS` variables map canonical names to a list of variants. We can also create dicts that map the alias name to either its canonical (parent) name, or to its one letter abbreviation. These dictionaries are created by the `make_parents` and `make_one_letter` methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One important question is, how many variants do we find in the CCD? Here they are, sorted by the canonical (parent) residue name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " DA  41\n",
      " DT  59\n",
      " DC  55\n",
      " DG  66\n",
      " DI   2\n",
      "  A  40\n",
      "  U  45\n",
      "  C  26\n",
      "  G  33\n",
      "  I   0\n"
     ]
    }
   ],
   "source": [
    "for parent in _DNA_RESIDUES + _RNA_RESIDUES:\n",
    "    print(\"{:>3} {:>3}\".format(parent, len(_CCD_VARIANTS[parent])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modifying MDTraj's `is_nucleic`\n",
    "\n",
    "This currently requires my branch \"nucleic\" of MDTraj, which is the basis of the PR [mdtraj/mdtraj#1398](https://github.com/mdtraj/mdtraj/pull/1398). Currently, the `is_nucleic` functionality is not implemented at all in the core code.\n",
    "\n",
    "In addition to checking out that branch, you'll also need to download the file for PDB entry [486d](https://www.rcsb.org/structure/486d) and put that in the same directory as this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mdtraj as md\n",
    "from mdtraj.core.residue_names import residues_and_codes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "346\n"
     ]
    }
   ],
   "source": [
    "traj = md.load(\"486d.pdb\")\n",
    "print(traj.topology.n_residues)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "313\n"
     ]
    }
   ],
   "source": [
    "# count how many residues count as nucleic\n",
    "print(sum([res.is_nucleic for res in traj.topology.residues]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['UCCGUGAUAGUUAAGGCAGAAUCCAGAUGAAAUUGGAGAUGGGGCAAUUCCCCGUCGCGGAGCC',\n",
       " 'UUU',\n",
       " 'GCGGAUUUACUCAGGGGAGAGCCCCGCUGUCCGUGGGAGUCUGUGCGUCCACAGAAUUCGC',\n",
       " 'GUC',\n",
       " 'UCCGUGAUAGUUAAGGCAGAAUGGGCGCUGUCCGUGCCAGAUGGGGCAAUUCCCCGUCGCGGAG',\n",
       " 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
       " 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
       " '']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "traj.topology.to_fasta()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default behavior in this branch does not include the linkers defined in the CCD. So let's monkey-patch MDTraj to include the extra residues we want to identify as nucleic. The two variables we need to change are `_NUCLEIC_RESIDUES`, which is a list of the residues that count as nucleic, and `_NUCLEIC_ACID_CODES`, which maps the residue to its 1-letter abbreviation (allowing it to be used in FASTA representation.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "residues, codes = residues_and_codes([_OFFICIAL_VARIANTS,\n",
    "                                      _AMBER_VARIANTS,\n",
    "                                      _PROTONATED_VARIANTS,\n",
    "                                      _CCD_VARIANTS])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# monkey patching MDTraj to use other codes/residues\n",
    "md.core.topology._NUCLEIC_RESIDUES = residues\n",
    "md.core.topology._NUCLEIC_ACID_CODES = codes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I was a little surprised to see that GDP is included in the nucleic linking residues in the CCD. Neither GTP nor ADP are included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'GDP' in residues"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'GTP' in residues"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'ADP' in residues"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's count up how many residues from this PDB entry are considered nucleic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "342"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum([res.is_nucleic for res in traj.topology.residues])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All but 4 residues are nucleic. Which ones are still not nucleic?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IR1\n",
      "IR2\n",
      "IR3\n",
      "IR4\n"
     ]
    }
   ],
   "source": [
    "for res in traj.topology.residues:\n",
    "    if not res.is_nucleic:\n",
    "        print(res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The FASTA strings are also changed for the chains that contained these residues. You can tell that some strings are longer, although it takes close inspection to see "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['UCCGUGAUAGUUUAAUGGUCAGAAUGCCAGACUGAAGAUCUGGAGAUCGGGGUUCAAUUCCCCGUCGCGGAGCC',\n",
       " 'UUU',\n",
       " 'GCGGAUUUAGCUCAGUUGGGAGAGCGCCCGCUUGUCGCGUGGGAGGUCCUGUGUUCGAUCCACAGAAUUCGC',\n",
       " 'GUC',\n",
       " 'UCCGUGAUAGUUUAAUGGUCAGAAUGGGCGCUUGUCGCGUGCCAGAUCGGGGUUCAAUUCCCCGUCGCGGAG',\n",
       " 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
       " 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
       " '']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "traj.topology.to_fasta()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say that we don't consider the residue [YYG](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/YYG) to be similar enough to guanosine to count for the letter `G` in the FASTA code. Instead, we'll use the generic `N` for nucleic acid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['UCCGUGAUAGUUUAAUGGUCAGAAUGCCAGACUGAANAUCUGGAGAUCGGGGUUCAAUUCCCCGUCGCGGAGCC',\n",
       " 'UUU',\n",
       " 'GCGGAUUUAGCUCAGUUGGGAGAGCGCCCGCUUGUCGCGUGGGAGGUCCUGUGUUCGAUCCACAGAAUUCGC',\n",
       " 'GUC',\n",
       " 'UCCGUGAUAGUUUAAUGGUCAGAAUGGGCGCUUGUCGCGUGCCAGAUCGGGGUUCAAUUCCCCGUCGCGGAG',\n",
       " 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
       " 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
       " '']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "residues, codes = residues_and_codes([_OFFICIAL_VARIANTS,\n",
    "                                      _AMBER_VARIANTS,\n",
    "                                      _PROTONATED_VARIANTS,\n",
    "                                      _CCD_VARIANTS],\n",
    "                                     code_dict={'YYG': 'N'})\n",
    "md.core.topology._NUCLEIC_RESIDUES = residues\n",
    "md.core.topology._NUCLEIC_ACID_CODES = codes\n",
    "traj.topology.to_fasta()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that sections sequence which was previously `GAAG` in the first chain is now `GAAN`. What if we wanted all the extra entries from the CCD to list as `N`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['UCCGUGAUAGUUNAANGGNCAGAAUNCCAGANUGAANAUNUGGAGAUNGGGGNNCAAUUCCCCGUCGCGGAGCC',\n",
       " 'UUU',\n",
       " 'GCGGAUUUANCUCAGNNGGGAGAGCNCCCGCNUGUCNCGUGGGAGNUCNUGUGNNCGNUCCACAGAAUUCGC',\n",
       " 'GUC',\n",
       " 'UCCGUGAUAGUUNAANGGNCAGAAUGGGCGCNUGUCNCGUGCCAGAUNGGGGNNCAAUUCCCCGUCGCGGAG',\n",
       " 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
       " 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
       " '']"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_ccd_as_N = {alias: 'N'\n",
    "                for aliases in _CCD_VARIANTS.values()\n",
    "                for alias in aliases}\n",
    "residues, codes = residues_and_codes([_OFFICIAL_VARIANTS,\n",
    "                                      _AMBER_VARIANTS,\n",
    "                                      _PROTONATED_VARIANTS,\n",
    "                                      _CCD_VARIANTS],\n",
    "                                     code_dict=all_ccd_as_N)\n",
    "md.core.topology._NUCLEIC_RESIDUES = residues\n",
    "md.core.topology._NUCLEIC_ACID_CODES = codes\n",
    "traj.topology.to_fasta()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Which residues count as \"nucleic\"?\n",
	"\n",
	"It turns out this this question is a little more complicated that it would appear at first glance. The canonical PDB names for DNA and RNA residues are easy enough. But after that, we have two kinds of problems:\n",
	"\n",
	"* There are sometimes-used residue names for nucleoside variants that are not the official 3-letter names from the [Chemical Components Dictionary (CCD)](http://www.wwpdb.org/data/ccd). For example, `DA5` for adenine at the 5' end of DNA; this name is officially assigned to [another molecule](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/DA5).\n",
	"* There are residues that can play the role of nucleic acid linkers (i.e., can be part of the nucleic acid polymer) but are not the canonical bases (for example, methylated versions of bases). These can be identified by the correct listings in the CCD, which are not often included in definitions of an `is_nucleic` method.\n",
	"\n",
	"Below, I will describe ways to obtain all the relevant residues, and to provide mappings of them to the standard base abbreviations (either 3-letter or 1-letter).\n",
	"\n",
	"Most the the things here will be saved as `dict`s. The format I'm using maps the \"parent\" residue (standard DNA/RNA name, such as `DA` or `A`) to a list of its variants. This format is relatively easy to create and understand. I've added code to (a branch of) MDTraj to use this format to create the frozenset of residues, as well as the mapping of full name to the 1-letter codes. It also provides some flexibility to customize this mapping."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"# these are the official PDB residues\n",
	"_DNA_RESIDUES = ['DA', 'DT', 'DC', 'DG', 'DI']\n",
	"_RNA_RESIDUES = ['A', 'U', 'C', 'G', 'I']\n",
	"_OFFICIAL_VARIANTS = {key: [key] for key in _DNA_RESIDUES + _RNA_RESIDUES}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next we have several residue names that aren't officially correct. I know of two main varieties here: things that use internal Amber names for the bare (`xxN`), 5' terminal (`xx5`), and 3' terminal (`xx3`) residues, and things that are supposed to represent protonated residues. Some of these may conflict with names in the CCD."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"_AMBER_VARIANTS = {\n",
	" 'DA': ['DAN', 'DA5', 'DA3'],\n",
	" 'DT': ['DTN', 'DT5', 'DT3'],\n",
	" 'DC': ['DCN', 'DC5', 'DC3'],\n",
	" 'DG': ['DGN', 'DG5', 'DG3'],\n",
	" 'A': ['AN', 'A5', 'A3', 'RAN', 'RA5', 'RA3'],\n",
	" 'U': ['UN', 'U5', 'U3', 'RUN', 'RU5', 'RU3'],\n",
	" 'C': ['CN', 'C5', 'C3', 'RCN', 'RC5', 'RC3'],\n",
	" 'G': ['GN', 'G5', 'G3', 'RGN', 'RG5', 'RG3'],\n",
	"}\n",
	"\n",
	"# protonated forms originally listed in ParmEd\n",
	"_PROTONATED_VARIANTS = {\n",
	" 'DA': ['DAP'],\n",
	" 'DC': ['DCP']\n",
	"}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we dive into the big problem, which is that there are many molecules in the CCD that can be used in nucleic acid polymers, but aren't the standard nucleosides. The most important of these are probably the methylated variants that play an important role in epigenetic regulation. However, there are hundreds of such molecules.\n",
	"\n",
	"To run the following, you should download the `components.cif` file from the [CCD](http://www.wwpdb.org/data/ccd) and put it in the same directory as this notebook."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"awk_script = \"\"\"\n",
	"BEGIN { one_letter = \"\"; parent = \"\"; is_nucleic = 0 }\n",
	"/^data_/ { if (is_nucleic) { print one_letter, parent, last_data };\n",
	" last_data = $0;\n",
	" is_nucleic = 0;\n",
	"}\n",
	"/_chem_comp\\.mon_nstd_parent_comp_id/ { parent = $2 }\n",
	"/_chem_comp\\.one_letter_code/ { one_letter = $2 }\n",
	"/_chem_comp\\.type.*[RD]NA LINKING/ { is_nucleic = 1 }\n",
	"END { if (is_nucleic) { print one_letter, parent, last_data } }\n",
	"\"\"\""
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"quoted_awk = \"'\" + awk_script + \"'\"\n",
	"ccd_data = !awk {quoted_awk} components.cif"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"import collections\n",
	"\n",
	"def get_ccd_aliases(ccd_data, disallow=[\"?\"]):\n",
	" # default disallow ignores unknown parents\n",
	" results = collections.defaultdict(list)\n",
	" for line in ccd_data:\n",
	" one_letter, parent, data_id = line.split()\n",
	" if parent not in disallow:\n",
	" results[parent].append(data_id[5:])\n",
	" return results"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"_CCD_VARIANTS = get_ccd_aliases(ccd_data)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"At this point we have all the main data. The `*_VARIANTS` variables map canonical names to a list of variants. We can also create dicts that map the alias name to either its canonical (parent) name, or to its one letter abbreviation. These dictionaries are created by the `make_parents` and `make_one_letter` methods."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"One important question is, how many variants do we find in the CCD? Here they are, sorted by the canonical (parent) residue name:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" DA 41\n",
	" DT 59\n",
	" DC 55\n",
	" DG 66\n",
	" DI 2\n",
	" A 40\n",
	" U 45\n",
	" C 26\n",
	" G 33\n",
	" I 0\n"
	]
	}
	],
	"source": [
	"for parent in _DNA_RESIDUES + _RNA_RESIDUES:\n",
	" print(\"{:>3} {:>3}\".format(parent, len(_CCD_VARIANTS[parent])))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Modifying MDTraj's `is_nucleic`\n",
	"\n",
	"This currently requires my branch \"nucleic\" of MDTraj, which is the basis of the PR [mdtraj/mdtraj#1398](https://github.com/mdtraj/mdtraj/pull/1398). Currently, the `is_nucleic` functionality is not implemented at all in the core code.\n",
	"\n",
	"In addition to checking out that branch, you'll also need to download the file for PDB entry [486d](https://www.rcsb.org/structure/486d) and put that in the same directory as this notebook."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"import mdtraj as md\n",
	"from mdtraj.core.residue_names import residues_and_codes"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"346\n"
	]
	}
	],
	"source": [
	"traj = md.load(\"486d.pdb\")\n",
	"print(traj.topology.n_residues)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"313\n"
	]
	}
	],
	"source": [
	"# count how many residues count as nucleic\n",
	"print(sum([res.is_nucleic for res in traj.topology.residues]))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['UCCGUGAUAGUUAAGGCAGAAUCCAGAUGAAAUUGGAGAUGGGGCAAUUCCCCGUCGCGGAGCC',\n",
	" 'UUU',\n",
	" 'GCGGAUUUACUCAGGGGAGAGCCCCGCUGUCCGUGGGAGUCUGUGCGUCCACAGAAUUCGC',\n",
	" 'GUC',\n",
	" 'UCCGUGAUAGUUAAGGCAGAAUGGGCGCUGUCCGUGCCAGAUGGGGCAAUUCCCCGUCGCGGAG',\n",
	" 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
	" 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
	" '']"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"traj.topology.to_fasta()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The default behavior in this branch does not include the linkers defined in the CCD. So let's monkey-patch MDTraj to include the extra residues we want to identify as nucleic. The two variables we need to change are `_NUCLEIC_RESIDUES`, which is a list of the residues that count as nucleic, and `_NUCLEIC_ACID_CODES`, which maps the residue to its 1-letter abbreviation (allowing it to be used in FASTA representation.)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [],
	"source": [
	"residues, codes = residues_and_codes([_OFFICIAL_VARIANTS,\n",
	" _AMBER_VARIANTS,\n",
	" _PROTONATED_VARIANTS,\n",
	" _CCD_VARIANTS])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [],
	"source": [
	"# monkey patching MDTraj to use other codes/residues\n",
	"md.core.topology._NUCLEIC_RESIDUES = residues\n",
	"md.core.topology._NUCLEIC_ACID_CODES = codes"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I was a little surprised to see that GDP is included in the nucleic linking residues in the CCD. Neither GTP nor ADP are included."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"True"
	]
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"'GDP' in residues"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"False"
	]
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"'GTP' in residues"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"False"
	]
	},
	"execution_count": 16,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"'ADP' in residues"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now let's count up how many residues from this PDB entry are considered nucleic:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"342"
	]
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"sum([res.is_nucleic for res in traj.topology.residues])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"All but 4 residues are nucleic. Which ones are still not nucleic?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"IR1\n",
	"IR2\n",
	"IR3\n",
	"IR4\n"
	]
	}
	],
	"source": [
	"for res in traj.topology.residues:\n",
	" if not res.is_nucleic:\n",
	" print(res)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The FASTA strings are also changed for the chains that contained these residues. You can tell that some strings are longer, although it takes close inspection to see "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['UCCGUGAUAGUUUAAUGGUCAGAAUGCCAGACUGAAGAUCUGGAGAUCGGGGUUCAAUUCCCCGUCGCGGAGCC',\n",
	" 'UUU',\n",
	" 'GCGGAUUUAGCUCAGUUGGGAGAGCGCCCGCUUGUCGCGUGGGAGGUCCUGUGUUCGAUCCACAGAAUUCGC',\n",
	" 'GUC',\n",
	" 'UCCGUGAUAGUUUAAUGGUCAGAAUGGGCGCUUGUCGCGUGCCAGAUCGGGGUUCAAUUCCCCGUCGCGGAG',\n",
	" 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
	" 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
	" '']"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"traj.topology.to_fasta()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's say that we don't consider the residue [YYG](https://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/YYG) to be similar enough to guanosine to count for the letter `G` in the FASTA code. Instead, we'll use the generic `N` for nucleic acid."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['UCCGUGAUAGUUUAAUGGUCAGAAUGCCAGACUGAANAUCUGGAGAUCGGGGUUCAAUUCCCCGUCGCGGAGCC',\n",
	" 'UUU',\n",
	" 'GCGGAUUUAGCUCAGUUGGGAGAGCGCCCGCUUGUCGCGUGGGAGGUCCUGUGUUCGAUCCACAGAAUUCGC',\n",
	" 'GUC',\n",
	" 'UCCGUGAUAGUUUAAUGGUCAGAAUGGGCGCUUGUCGCGUGCCAGAUCGGGGUUCAAUUCCCCGUCGCGGAG',\n",
	" 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
	" 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
	" '']"
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"residues, codes = residues_and_codes([_OFFICIAL_VARIANTS,\n",
	" _AMBER_VARIANTS,\n",
	" _PROTONATED_VARIANTS,\n",
	" _CCD_VARIANTS],\n",
	" code_dict={'YYG': 'N'})\n",
	"md.core.topology._NUCLEIC_RESIDUES = residues\n",
	"md.core.topology._NUCLEIC_ACID_CODES = codes\n",
	"traj.topology.to_fasta()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note that sections sequence which was previously `GAAG` in the first chain is now `GAAN`. What if we wanted all the extra entries from the CCD to list as `N`?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['UCCGUGAUAGUUNAANGGNCAGAAUNCCAGANUGAANAUNUGGAGAUNGGGGNNCAAUUCCCCGUCGCGGAGCC',\n",
	" 'UUU',\n",
	" 'GCGGAUUUANCUCAGNNGGGAGAGCNCCCGCNUGUCNCGUGGGAGNUCNUGUGNNCGNUCCACAGAAUUCGC',\n",
	" 'GUC',\n",
	" 'UCCGUGAUAGUUNAANGGNCAGAAUGGGCGCNUGUCNCGUGCCAGAUNGGGGNNCAAUUCCCCGUCGCGGAG',\n",
	" 'CGUCACACCAUGGAAGGGGGUUGCAAAGAAAGGGUGGCCCCCUUCGGGGCGCCUAACUUUGUGAUUCCUAACUGGGGUGAAGUCG',\n",
	" 'CUGGGGAGUACGGCCGCAAGGCUGAAACUCAAA',\n",
	" '']"
	]
	},
	"execution_count": 21,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"all_ccd_as_N = {alias: 'N'\n",
	" for aliases in _CCD_VARIANTS.values()\n",
	" for alias in aliases}\n",
	"residues, codes = residues_and_codes([_OFFICIAL_VARIANTS,\n",
	" _AMBER_VARIANTS,\n",
	" _PROTONATED_VARIANTS,\n",
	" _CCD_VARIANTS],\n",
	" code_dict=all_ccd_as_N)\n",
	"md.core.topology._NUCLEIC_RESIDUES = residues\n",
	"md.core.topology._NUCLEIC_ACID_CODES = codes\n",
	"traj.topology.to_fasta()"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}