Skip to content

Instantly share code, notes, and snippets.

@ptosco
Created June 24, 2021 16:55
Show Gist options
  • Save ptosco/2b19142ff8fd6afdfee12836cec73d4f to your computer and use it in GitHub Desktop.
Save ptosco/2b19142ff8fd6afdfee12836cec73d4f to your computer and use it in GitHub Desktop.
Use a custom normalization reaction list with the MolStandardizer
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2021.03.1\n"
]
}
],
"source": [
"import rdkit\n",
"print(rdkit.__version__)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import tempfile\n",
"from rdkit import Chem\n",
"from rdkit.Chem.Draw import MolsToGridImage\n",
"from rdkit.Chem.MolStandardize import rdMolStandardize"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass the MolStandardizer a custom list of normalization reactions.<br/>\n",
"Here I copied the standard RDKit list and just tweaked the `Pyridine oxide to n+O` rule a bit to make it more specific such that it is not triggered by molecules which are not actually N-oxides (as yours):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"custom_normalizations = \"\"\"// Opposite of #2.1 in InChI technical manual? Covered by RDKit\n",
"// Sanitization.\n",
"Nitro to N+(O-)=O\t[N,P,As,Sb;X3:1](=[O,S,Se,Te:2])=[O,S,Se,Te:3]>>[*+1:1]([*-1:2])=[*:3]\n",
"Sulfone to S(=O)(=O)\t[S+2:1]([O-:2])([O-:3])>>[S+0:1](=[O-0:2])(=[O-0:3])\n",
"Pyridine oxide to n+O-\t[nH0+0:1]=[O:2]>>[n+:1][O-:2]\n",
"Azide to N=N+=N-\t[*:1][N:2]=[N:3]#[N:4]>>[*:1][N:2]=[N+:3]=[N-:4]\n",
"Diazo/azo to =N+=N-\t[*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]\n",
"Sulfoxide to -S+(O-)-\t[!O:1][S+0;X3:2](=[O:3])[!O:4]>>[*:1][S+1:2]([O-:3])[*:4]\n",
"// Equivalent to #1.5 in InChI technical manual\n",
"Phosphate to P(O-)=O\t[O,S,Se,Te;-1:1][P+;D4:2][O,S,Se,Te;-1:3]>>[*+0:1]=[P+0;D5:2][*-1:3]\n",
"// Equivalent to #1.8 in InChI technical manual\n",
"C/S+N to C/S=N+\t[C,S&!$([S+]-[O-]);X3+1:1]([NX3:2])[NX3!H0:3]>>[*+0:1]([N:2])=[N+:3]\n",
"// Equivalent to #1.8 in InChI technical manual\n",
"P+N to P=N+\t[P;X4+1:1]([NX3:2])[NX3!H0:3]>>[*+0:1]([N:2])=[N+:3]\n",
"Normalize hydrazine-diazonium\t[CX4:1][NX3H:2]-[NX3H:3][CX4:4][NX2+:5]#[NX1:6]>>[CX4:1][NH0:2]=[NH+:3][C:4][N+0:5]=[NH:6]\n",
"// Equivalent to #1.3 in InChI technical manual\n",
"Recombine 1,3-separated charges\t[N,P,As,Sb,O,S,Se,Te;-1:1]-[A+0:2]=[N,P,As,Sb,O,S,Se,Te;+1:3]>>[*-0:1]=[*:2]-[*+0:3]\n",
"Recombine 1,3-separated charges\t[n,o,p,s;-1:1]:[a:2]=[N,O,P,S;+1:3]>>[*-0:1]:[*:2]-[*+0:3]\n",
"Recombine 1,3-separated charges\t[N,O,P,S;-1:1]-[a:2]:[n,o,p,s;+1:3]>>[*-0:1]=[*:2]:[*+0:3]\n",
"Recombine 1,5-separated charges\t[N,P,As,Sb,O,S,Se,Te;-1:1]-[A+0:2]=[A:3]-[A:4]=[N,P,As,Sb,O,S,Se,Te;+1:5]>>[*-0:1]=[*:2]-[*:3]=[*:4]-[*+0:5]\n",
"Recombine 1,5-separated charges\t[n,o,p,s;-1:1]:[a:2]:[a:3]:[c:4]=[N,O,P,S;+1:5]>>[*-0:1]:[*:2]:[*:3]:[c:4]-[*+0:5]\n",
"Recombine 1,5-separated charges\t[N,O,P,S;-1:1]-[c:2]:[a:3]:[a:4]:[n,o,p,s;+1:5]>>[*-0:1]=[c:2]:[*:3]:[*:4]:[*+0:5]\n",
"// Conjugated cation rules taken from Francis Atkinson's standardiser. Those\n",
"// that can reduce aromaticity aren't included\n",
"Normalize 1,3 conjugated cation\t[N,O;+0!H0:1]-[A:2]=[N!$(*[O-]),O;+1H0:3]>>[*+1:1]=[*:2]-[*+0:3]\n",
"Normalize 1,3 conjugated cation\t[n;+0!H0:1]:[c:2]=[N!$(*[O-]),O;+1H0:3]>>[*+1:1]:[*:2]-[*+0:3]\n",
"Normalize 1,5 conjugated cation\t[N,O;+0!H0:1]-[A:2]=[A:3]-[A:4]=[N!$(*[O-]),O;+1H0:5]>>[*+1:1]=[*:2]-[*:3]=[*:4]-[*+0:5]\n",
"Normalize 1,5 conjugated cation\t[n;+0!H0:1]:[a:2]:[a:3]:[c:4]=[N!$(*[O-]),O;+1H0:5]>>[n+1:1]:[*:2]:[*:3]:[*:4]-[*+0:5]\n",
"// Equivalent to #1.6 in InChI technical manual. RDKit Sanitization handles\n",
"// this for perchlorate.\n",
"Charge normalization\t[F,Cl,Br,I,At;-1:1]=[O:2]>>[*-0:1][O-:2]\n",
"Charge recombination\t[N,P,As,Sb;-1:1]=[C+;v3:2]>>[*+0:1]#[C+0:2]\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"m = Chem.MolFromSmiles('Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"params = rdMolStandardize.CleanupParameters()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a one-off cleanup job you can use `rdMolStandardize.Cleanup()`:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"with tempfile.NamedTemporaryFile() as hnd:\n",
" hnd.write(custom_normalizations.encode(\"utf-8\"))\n",
" hnd.flush()\n",
" params.normalizationsFile = hnd.name\n",
" clean_mol = rdMolStandardize.Cleanup(m, params)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<rdkit.Chem.rdchem.Mol at 0x7fbf06e2a0d0>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clean_mol"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As parsing the same file multiple times is not very efficient, if you need to clean up multiple molecules you might prefer to create objects once..."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"params = rdMolStandardize.CleanupParameters()\n",
"metal_disconnector = rdMolStandardize.MetalDisconnector()\n",
"normalizer = rdMolStandardize.NormalizerFromData(custom_normalizations, params)\n",
"reionizer = rdMolStandardize.Reionizer()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ...and then re-use them (e.g., in a loop) for all the molecules you need to standardize; these are the operations that `rdMolStandardize.Cleanup()` carries out in sequence:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"clean_mol = metal_disconnector.Disconnect(m)\n",
"clean_mol = normalizer.normalize(clean_mol)\n",
"clean_mol = reionizer.reionize(clean_mol)\n",
"Chem.AssignStereochemistry(clean_mol)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In either case, the amended pattern for pyridine _N_-oxides now does not cause trouble anymore, and you can get other tautomers of that molecule correctly:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"taut = rdMolStandardize.TautomerEnumerator()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MolsToGridImage(list(taut.Enumerate(clean_mol)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment