Skip to content

Instantly share code, notes, and snippets.

@PatWalters
Last active November 15, 2023 01:26
Show Gist options
  • Save PatWalters/4f9f383914019a5cada4d3dd4ac4762b to your computer and use it in GitHub Desktop.
Save PatWalters/4f9f383914019a5cada4d3dd4ac4762b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"id": "e3126b59",
"metadata": {},
"outputs": [],
"source": [
"import chemfp\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "783d5d82",
"metadata": {},
"source": [
"Let's assume we have the set of compounds in **set_A.smi.gz** and we want to purchase a diverse set of compounds from **set_B.smi.gz** that are not similar to those in **set_A.smi.gz**. "
]
},
{
"cell_type": "markdown",
"id": "2dc8270a",
"metadata": {},
"source": [
"1. Convert SMILES to RDKit fingerprints and read into arenas"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a73183a1",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "960047c7a4d74624ac117c57e743887d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"set_A.smi.gz: 0%| …"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"arena_A = chemfp.rdkit2fps(\"set_A.smi.gz\",\"set_A.fpb\", overwrite=False).load_output()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a69a12bc",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "926c2c5d024a4ec996865439556d9df8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"set_B.smi.gz: 0%| …"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"arena_B = chemfp.rdkit2fps(\"set_B.smi.gz\",\"set_B.fpb\", overwrite=False).load_output()"
]
},
{
"cell_type": "markdown",
"id": "87d5cb54",
"metadata": {},
"source": [
"2. Calculate the similarity of molecules in **set_B** to molecules in **set_A**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bf24b022",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "e9be523f439542be9f36b278121d0d2c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"queries: 0%| …"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"result = chemfp.simsearch(queries=arena_B, targets=arena_A, k=1)"
]
},
{
"cell_type": "markdown",
"id": "b7d95548",
"metadata": {},
"source": [
"3. Get the indicies of the molecules in **set_B** with similarity less than the cutoff value"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f5f1d1fc",
"metadata": {},
"outputs": [],
"source": [
"similarity_cutoff = 0.25\n",
"not_similar_df = result.to_pandas().query(\"score < @similarity_cutoff\")\n",
"not_similar_idx = [arena_B.get_index_by_id(x) for x in not_similar_df.query_id]"
]
},
{
"cell_type": "markdown",
"id": "652f93cc",
"metadata": {},
"source": [
"4. Create an arena with **not_similar_index**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ec516ba9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"38651"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"not_similar_arena = arena_B.copy(indices=not_similar_idx)\n",
"len(not_similar_arena)"
]
},
{
"cell_type": "markdown",
"id": "8075d488",
"metadata": {},
"source": [
"5. Cluster the compounds in **not_similar_df**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9b92c718",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "145a3c73278d481b9d14b3e0803402f0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"NxN: 5%|5 | 2000/38651 [00:00<00:11, 3109.89/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"clusters = chemfp.butina(fingerprints=not_similar_arena,NxN_threshold=0.35)"
]
},
{
"cell_type": "markdown",
"id": "457275cc",
"metadata": {},
"source": [
"6. Get the cluster centers"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3c2de564",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"856"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"center_df = clusters.to_pandas().query(\"type=='CENTER'\")\n",
"len(center_df)"
]
},
{
"cell_type": "markdown",
"id": "f8e6f499",
"metadata": {},
"source": [
"7. Read the SMILES for set_B so we can merge this with **center_df**. Note that we need to change the datatype of the \"Name\" column to **str** to be consistent with the dataframe returned by ChemFP. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "ecf56df7",
"metadata": {},
"outputs": [],
"source": [
"df_B = pd.read_csv(\"set_B.smi.gz\",sep=\" \",names=[\"SMILES\",\"Name\"])\n",
"df_B.Name = df_B.Name.astype(str)"
]
},
{
"cell_type": "markdown",
"id": "6520ecf5",
"metadata": {},
"source": [
"8. Merge the datasets"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "8940ef52",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>SMILES</th>\n",
" <th>Name</th>\n",
" <th>cluster</th>\n",
" <th>id</th>\n",
" <th>type</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>CCCCCCCCc1ccc(-c2ccc(C3=CN(/N=C/c4ccc(OCC)cc4)...</td>\n",
" <td>504767809</td>\n",
" <td>469</td>\n",
" <td>504767809</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Cc1ccc(S(=O)(=O)c2ccc(N/N=C/c3s/c(=N/S(=O)(=O)...</td>\n",
" <td>97980914</td>\n",
" <td>847</td>\n",
" <td>97980914</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>CC1(C)CC(=O)C2=C(C1)N(c1nnc(SCc3ccccc3Cl)s1)C(...</td>\n",
" <td>98006314</td>\n",
" <td>170</td>\n",
" <td>98006314</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Clc1ccc(COc2cccc(/C=N/Nc3nc(-c4ccc(Br)cc4)cs3)...</td>\n",
" <td>409074051</td>\n",
" <td>112</td>\n",
" <td>409074051</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>CCCCCCCCOc1ccc(-c2nnc(-c3ccc(-n4c5cc(Br)ccc5c5...</td>\n",
" <td>97971317</td>\n",
" <td>430</td>\n",
" <td>97971317</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>851</th>\n",
" <td>COc1ccc(N2C(=S)N(C(=O)Oc3ccccc3)/C(=N/c3c(C)cc...</td>\n",
" <td>33405087</td>\n",
" <td>635</td>\n",
" <td>33405087</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>852</th>\n",
" <td>C=CCOC(=O)c1cc(-c2ccc(/C=C3\\C(=O)N(c4ccc(Cl)c(...</td>\n",
" <td>100147482</td>\n",
" <td>261</td>\n",
" <td>100147482</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>853</th>\n",
" <td>CCc1ccc(-c2cs/c(=N\\C)n2/N=C/c2ccc(OS(=O)(=O)c3...</td>\n",
" <td>102960331</td>\n",
" <td>571</td>\n",
" <td>102960331</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>854</th>\n",
" <td>Cc1ccc(/N=C2\\S/C(=C/c3cc([N+](=O)[O-])ccc3N3CC...</td>\n",
" <td>408914622</td>\n",
" <td>84</td>\n",
" <td>408914622</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>855</th>\n",
" <td>COC(=O)CC[C@H](C)[C@H]1CC[C@H]2[C@@H]3CC[C@H]4...</td>\n",
" <td>118924972</td>\n",
" <td>187</td>\n",
" <td>118924972</td>\n",
" <td>CENTER</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>856 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" SMILES Name cluster \\\n",
"0 CCCCCCCCc1ccc(-c2ccc(C3=CN(/N=C/c4ccc(OCC)cc4)... 504767809 469 \n",
"1 Cc1ccc(S(=O)(=O)c2ccc(N/N=C/c3s/c(=N/S(=O)(=O)... 97980914 847 \n",
"2 CC1(C)CC(=O)C2=C(C1)N(c1nnc(SCc3ccccc3Cl)s1)C(... 98006314 170 \n",
"3 Clc1ccc(COc2cccc(/C=N/Nc3nc(-c4ccc(Br)cc4)cs3)... 409074051 112 \n",
"4 CCCCCCCCOc1ccc(-c2nnc(-c3ccc(-n4c5cc(Br)ccc5c5... 97971317 430 \n",
".. ... ... ... \n",
"851 COc1ccc(N2C(=S)N(C(=O)Oc3ccccc3)/C(=N/c3c(C)cc... 33405087 635 \n",
"852 C=CCOC(=O)c1cc(-c2ccc(/C=C3\\C(=O)N(c4ccc(Cl)c(... 100147482 261 \n",
"853 CCc1ccc(-c2cs/c(=N\\C)n2/N=C/c2ccc(OS(=O)(=O)c3... 102960331 571 \n",
"854 Cc1ccc(/N=C2\\S/C(=C/c3cc([N+](=O)[O-])ccc3N3CC... 408914622 84 \n",
"855 COC(=O)CC[C@H](C)[C@H]1CC[C@H]2[C@@H]3CC[C@H]4... 118924972 187 \n",
"\n",
" id type score \n",
"0 504767809 CENTER 1.0 \n",
"1 97980914 CENTER 1.0 \n",
"2 98006314 CENTER 1.0 \n",
"3 409074051 CENTER 1.0 \n",
"4 97971317 CENTER 1.0 \n",
".. ... ... ... \n",
"851 33405087 CENTER 1.0 \n",
"852 100147482 CENTER 1.0 \n",
"853 102960331 CENTER 1.0 \n",
"854 408914622 CENTER 1.0 \n",
"855 118924972 CENTER 1.0 \n",
"\n",
"[856 rows x 6 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merge_df = df_B.merge(center_df,left_on='Name',right_on='id')\n",
"merge_df"
]
},
{
"cell_type": "markdown",
"id": "b0aaf008",
"metadata": {},
"source": [
"9. Save the merged dataframe"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7c430a1e",
"metadata": {},
"outputs": [],
"source": [
"merge_df.to_csv(\"cmpds_to_purchase.csv\",index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
This file has been truncated, but you can view the full file.
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment