Skip to content

Instantly share code, notes, and snippets.

@ptosco
Last active January 18, 2024 01:12
Show Gist options
  • Save ptosco/6d70cec235361fbaddc7cbc2cf9c3b5d to your computer and use it in GitHub Desktop.
Save ptosco/6d70cec235361fbaddc7cbc2cf9c3b5d to your computer and use it in GitHub Desktop.
UniquifyMatches
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from rdkit import Chem"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"pattern='C~C~C(~C)~C'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"smiles='O[C@H]1C[C@H]2C([C@@]1(C)CC2)(C)C'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pat = Chem.MolFromSmiles(pattern)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"mol = Chem.MolFromSmiles(smiles)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAYAAABNcIgQAAAFqklEQVR4nO3cPWtU7RbH4ZVkJpPBgJVoMYiEiN/BWlBE/Aq2ClYiWtiIosRGRQSxEQRtBDsJNhb6CdKkCxjN4AspYhg1zkwmOc0pngPnFA8Hnnvvva6rSvkvAr9Zd16m9vf39wMAkpouPQAAShJCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQgBSE0IAUhNCAFITQmpvPB7H58+fS88AakoIqb2lpaV4+PBh6RlATU3t7+/vlx4B/4/xeBztdrv0DKCmXITUXrvdjp2dnbhx40Zsb2+XngPUjBDSCLOzs9Fut2Nvb6/0FKBmPI3SOKPRKGZnZ0vPAGrCRUijPHjwIC5dulR6BlAjLkIaZXNzM+bn56Pb7ZaeAtSEi5BGOXToUMzNzcXr169jPB6XngPUgBDSOL9+/Yrnz5/H9+/fS08BasDTKACpuQhprJcvX8b169dLzwAqzkVIY62urka3242FhYXSU4AKE0Iab21tLRYXF0vPACrK0yiNNhgM4vTp07G+vl56ClBRLkIab3d3N1qtVukZQEW5CGm8VqsV79+/j2fPnpWeAlSQEJLCzs5OHDlypPQMoII8jZLKcDiMTqdTegZQIS5C0hgMBnH8+HG/OAP8BxchqfT7/ej1eqVnABXiIiSVXq8XGxsb8eHDh9JTgIoQQtJ58+ZNrKyslJ4BVISnUQBScxGS0mAwiFOnTsWnT59KTwEKcxGS1vLycpw5cyamp30ehMyEkNRGo1F8+fIljh07VnoKUIiPwqR27969ePz4cekZQEEuQlKbTCYxMzNTegZQkIuQ1GZmZuL3799x7dq12NraKj0HKEAISa/T6cTBgwdjamqq9BSgAE+j8Bd//vyJubm50jOAf5CLEP7t1q1bcfHixXj37l28evUqIqJSX/f7/bh9+3ZEROW+hjoTQgBS8zQKf+FpFPJxEZLeZDKJO3fuxI8fP0QQEhJC0hsOh7G9vR0eRyAnT6Ok5g/qARchqd29ezeuXr1aegZQkIuQ1PzTbcBFSFrLy8vRarVEEJITQlIaDAZx//792NjYKD0FKMzTKACpuQhJ58mTJ/Ho0aPSM4CKaJUeAP+0c+fOxcePH0vPACrC0yip9Pv96PV6pWcAFeJplDQGg0GcPHky1tfXS08BKsRFSCrD4TA6nU7pGUCF+BkhKbx9+zb29vbi7NmzpacAFeNplBS63W58+/at9AyggjyN0ni7u7vRann8AP47FyGNNhgM4sSJE35BBvifXIQ03traWiwuLpaeAVSUENJYq6ur0e12Y2FhofQUoMI8jdJYKysr8fTp09IzgIpzEQKQmouQxvn582ecP38++v1+6SlADQghjXPgwIG4cOFCHD58uPQUoAY8jdIom5ubMT8/H91ut/QUoCZchDTKixcv4vLly6VnADXiIqRxRqNRzM7Olp4B1ISLkEaYTCZx8+bN2NraEkHgbxFCGmE0GsV4PI7pad/SwN/jaZTaG4/H0W63S88AasrHZ2pvaWkprly5UnoGUFMuQmpvPB7H169f4+jRo6WnADUkhACk5mkUgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDUhBCA1IQQgNSEEIDU/gUHm+abMjOBjwAAAABJRU5ErkJggg==\n",
"text/plain": [
"<rdkit.Chem.rdchem.Mol at 0x7fd483507f80>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pat"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<rdkit.Chem.rdchem.Mol at 0x7fd4835140d0>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mol"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"matches_uniquified = mol.GetSubstructMatches(pat, uniquify=True)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"35"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(matches_uniquified)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1, 2, 3, 4, 8),\n",
" (1, 5, 4, 3, 9),\n",
" (1, 5, 4, 3, 10),\n",
" (1, 5, 4, 9, 10),\n",
" (2, 1, 5, 4, 6))"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"matches_uniquified[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"None of the uniquified tuples contains the same atom indices. In fact, the set of sorted tuples has the same length as the original tuple of tuples:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len({tuple(sorted(m)) for m in matches_uniquified}) == len(matches_uniquified)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"matches = mol.GetSubstructMatches(pat, uniquify=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With `uniquify=False` we get twice as many matches:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"70"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(matches)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, some of them actually contain the same indices, only as a different permutation:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((1, 2, 3, 4, 8),\n",
" (1, 2, 3, 8, 4),\n",
" (1, 5, 4, 3, 9),\n",
" (1, 5, 4, 3, 10),\n",
" (1, 5, 4, 9, 3))"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"matches[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we define a uniquifying function that removes tuple containing the same indices just in a different order and apply it to hte non-uniquified matches, we get the same result as when we called `GetSubstructMatches(uniquify=True)`:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def uniquify(matches):\n",
" res = []\n",
" seen = set()\n",
" for m in matches:\n",
" # sort the tuple\n",
" s = tuple(sorted(m))\n",
" # have we already seen this sorted tuple before?\n",
" # If so, skip it, otherwise add it to the\n",
" # uniquified result\n",
" if (s in seen):\n",
" continue\n",
" else:\n",
" res.append(m)\n",
" seen.add(s)\n",
" return tuple(res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our uniquifying function returns the same result as `GetSubstructMatches(uniquify=True)`"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uniquify(matches) == matches_uniquified"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment