Skip to content

Instantly share code, notes, and snippets.

@greglandrum
Created May 19, 2021 07:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save greglandrum/ad6ec0e9bc3272cd891319f1d81b2686 to your computer and use it in GitHub Desktop.
Save greglandrum/ad6ec0e9bc3272cd891319f1d81b2686 to your computer and use it in GitHub Desktop.
SMILES atom regex.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "import re",
"execution_count": 1,
"outputs": []
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "atom_finder = re.compile(r'\\[[^\\]]+\\]|[A-Z][a-z]?|[a-z]')\nsmiles = 'C[C@@H](Cl)C(=O)c1c[13cH]ccc1'\nprint(atom_finder.findall(smiles))\n",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": "['C', '[C@@H]', 'Cl', 'C', 'O', 'c', 'c', '[13cH]', 'c', 'c', 'c']\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "ms = [x for x in atom_finder.finditer(smiles)]\n[(x.start(),x.end()) for x in ms]",
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"execution_count": 16,
"data": {
"text/plain": "[(0, 1),\n (1, 7),\n (8, 10),\n (11, 12),\n (14, 15),\n (16, 17),\n (18, 19),\n (19, 25),\n (25, 26),\n (26, 27),\n (27, 28)]"
},
"metadata": {}
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python38264bitrdkitblogconda8e387449f04349d3a22f66dc2550acf5",
"display_name": "Python 3.8.2 64-bit ('rdkit_blog': conda)",
"language": "python"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"base_numbering": 1,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"language_info": {
"name": "python",
"version": "3.9.4",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "",
"data": {
"description": "SMILES atom regex.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@adalke
Copy link

adalke commented Oct 15, 2021

This regex doesn't handle * terms, and it interprets 'Cc' as a single atom term, rather than the two atoms terms 'C' and 'c'.

Here's an alternative version which handles both these cases, written using re's "verbose" notation:

atom_finder = re.compile(r"""
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] | # as are these single-letter elements
 \[[^]]+\]         # everything else must be in []s
)
""", re.X)

@greglandrum
Copy link
Author

Thanks Andrew!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment