PatWalters/screening_database_enhancement.ipynb

## screening_database_enhancement.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e3126b59",
   "metadata": {},
   "outputs": [],
   "source": [
    "import chemfp\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "783d5d82",
   "metadata": {},
   "source": [
    "Let's assume we have the set of compounds in **set_A.smi.gz** and we want to purchase a diverse set of compounds from **set_B.smi.gz** that are not similar to those in **set_A.smi.gz**. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dc8270a",
   "metadata": {},
   "source": [
    "1. Convert SMILES to RDKit fingerprints and read into arenas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a73183a1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "960047c7a4d74624ac117c57e743887d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "set_A.smi.gz:   0%|                                                                                           …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "arena_A = chemfp.rdkit2fps(\"set_A.smi.gz\",\"set_A.fpb\", overwrite=False).load_output()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a69a12bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "926c2c5d024a4ec996865439556d9df8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "set_B.smi.gz:   0%|                                                                                           …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "arena_B = chemfp.rdkit2fps(\"set_B.smi.gz\",\"set_B.fpb\", overwrite=False).load_output()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87d5cb54",
   "metadata": {},
   "source": [
    "2. Calculate the similarity of molecules in **set_B** to molecules in **set_A**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "bf24b022",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e9be523f439542be9f36b278121d0d2c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "queries:   0%|                                                                                                …"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "result = chemfp.simsearch(queries=arena_B, targets=arena_A, k=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7d95548",
   "metadata": {},
   "source": [
    "3. Get the indicies of the molecules in **set_B** with similarity less than the cutoff value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f5f1d1fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "similarity_cutoff = 0.25\n",
    "not_similar_df = result.to_pandas().query(\"score < @similarity_cutoff\")\n",
    "not_similar_idx = [arena_B.get_index_by_id(x) for x in not_similar_df.query_id]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "652f93cc",
   "metadata": {},
   "source": [
    "4. Create an arena with **not_similar_index**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ec516ba9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "38651"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "not_similar_arena = arena_B.copy(indices=not_similar_idx)\n",
    "len(not_similar_arena)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8075d488",
   "metadata": {},
   "source": [
    "5. Cluster the compounds in **not_similar_df**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9b92c718",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "145a3c73278d481b9d14b3e0803402f0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "NxN:   5%|5         | 2000/38651 [00:00<00:11, 3109.89/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "clusters = chemfp.butina(fingerprints=not_similar_arena,NxN_threshold=0.35)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "457275cc",
   "metadata": {},
   "source": [
    "6. Get the cluster centers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3c2de564",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "856"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "center_df = clusters.to_pandas().query(\"type=='CENTER'\")\n",
    "len(center_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8e6f499",
   "metadata": {},
   "source": [
    "7. Read the SMILES for set_B so we can merge this with **center_df**. Note that we need to change the datatype of the \"Name\" column to **str** to be consistent with the dataframe returned by ChemFP. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ecf56df7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_B = pd.read_csv(\"set_B.smi.gz\",sep=\" \",names=[\"SMILES\",\"Name\"])\n",
    "df_B.Name = df_B.Name.astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6520ecf5",
   "metadata": {},
   "source": [
    "8. Merge the datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "8940ef52",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SMILES</th>\n",
       "      <th>Name</th>\n",
       "      <th>cluster</th>\n",
       "      <th>id</th>\n",
       "      <th>type</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CCCCCCCCc1ccc(-c2ccc(C3=CN(/N=C/c4ccc(OCC)cc4)...</td>\n",
       "      <td>504767809</td>\n",
       "      <td>469</td>\n",
       "      <td>504767809</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Cc1ccc(S(=O)(=O)c2ccc(N/N=C/c3s/c(=N/S(=O)(=O)...</td>\n",
       "      <td>97980914</td>\n",
       "      <td>847</td>\n",
       "      <td>97980914</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CC1(C)CC(=O)C2=C(C1)N(c1nnc(SCc3ccccc3Cl)s1)C(...</td>\n",
       "      <td>98006314</td>\n",
       "      <td>170</td>\n",
       "      <td>98006314</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Clc1ccc(COc2cccc(/C=N/Nc3nc(-c4ccc(Br)cc4)cs3)...</td>\n",
       "      <td>409074051</td>\n",
       "      <td>112</td>\n",
       "      <td>409074051</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CCCCCCCCOc1ccc(-c2nnc(-c3ccc(-n4c5cc(Br)ccc5c5...</td>\n",
       "      <td>97971317</td>\n",
       "      <td>430</td>\n",
       "      <td>97971317</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>851</th>\n",
       "      <td>COc1ccc(N2C(=S)N(C(=O)Oc3ccccc3)/C(=N/c3c(C)cc...</td>\n",
       "      <td>33405087</td>\n",
       "      <td>635</td>\n",
       "      <td>33405087</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>852</th>\n",
       "      <td>C=CCOC(=O)c1cc(-c2ccc(/C=C3\\C(=O)N(c4ccc(Cl)c(...</td>\n",
       "      <td>100147482</td>\n",
       "      <td>261</td>\n",
       "      <td>100147482</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>853</th>\n",
       "      <td>CCc1ccc(-c2cs/c(=N\\C)n2/N=C/c2ccc(OS(=O)(=O)c3...</td>\n",
       "      <td>102960331</td>\n",
       "      <td>571</td>\n",
       "      <td>102960331</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>854</th>\n",
       "      <td>Cc1ccc(/N=C2\\S/C(=C/c3cc([N+](=O)[O-])ccc3N3CC...</td>\n",
       "      <td>408914622</td>\n",
       "      <td>84</td>\n",
       "      <td>408914622</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>855</th>\n",
       "      <td>COC(=O)CC[C@H](C)[C@H]1CC[C@H]2[C@@H]3CC[C@H]4...</td>\n",
       "      <td>118924972</td>\n",
       "      <td>187</td>\n",
       "      <td>118924972</td>\n",
       "      <td>CENTER</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>856 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                SMILES       Name  cluster  \\\n",
       "0    CCCCCCCCc1ccc(-c2ccc(C3=CN(/N=C/c4ccc(OCC)cc4)...  504767809      469   \n",
       "1    Cc1ccc(S(=O)(=O)c2ccc(N/N=C/c3s/c(=N/S(=O)(=O)...   97980914      847   \n",
       "2    CC1(C)CC(=O)C2=C(C1)N(c1nnc(SCc3ccccc3Cl)s1)C(...   98006314      170   \n",
       "3    Clc1ccc(COc2cccc(/C=N/Nc3nc(-c4ccc(Br)cc4)cs3)...  409074051      112   \n",
       "4    CCCCCCCCOc1ccc(-c2nnc(-c3ccc(-n4c5cc(Br)ccc5c5...   97971317      430   \n",
       "..                                                 ...        ...      ...   \n",
       "851  COc1ccc(N2C(=S)N(C(=O)Oc3ccccc3)/C(=N/c3c(C)cc...   33405087      635   \n",
       "852  C=CCOC(=O)c1cc(-c2ccc(/C=C3\\C(=O)N(c4ccc(Cl)c(...  100147482      261   \n",
       "853  CCc1ccc(-c2cs/c(=N\\C)n2/N=C/c2ccc(OS(=O)(=O)c3...  102960331      571   \n",
       "854  Cc1ccc(/N=C2\\S/C(=C/c3cc([N+](=O)[O-])ccc3N3CC...  408914622       84   \n",
       "855  COC(=O)CC[C@H](C)[C@H]1CC[C@H]2[C@@H]3CC[C@H]4...  118924972      187   \n",
       "\n",
       "            id    type  score  \n",
       "0    504767809  CENTER    1.0  \n",
       "1     97980914  CENTER    1.0  \n",
       "2     98006314  CENTER    1.0  \n",
       "3    409074051  CENTER    1.0  \n",
       "4     97971317  CENTER    1.0  \n",
       "..         ...     ...    ...  \n",
       "851   33405087  CENTER    1.0  \n",
       "852  100147482  CENTER    1.0  \n",
       "853  102960331  CENTER    1.0  \n",
       "854  408914622  CENTER    1.0  \n",
       "855  118924972  CENTER    1.0  \n",
       "\n",
       "[856 rows x 6 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "merge_df = df_B.merge(center_df,left_on='Name',right_on='id')\n",
    "merge_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0aaf008",
   "metadata": {},
   "source": [
    "9. Save the merged dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7c430a1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "merge_df.to_csv(\"cmpds_to_purchase.csv\",index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

## set_A.smi.gz

      
    Raw
  

              set_A.smi.gz
            
          
      This file has been truncated, but you can view the full file.
    

            View raw
        
    
## set_B.smi.gz

      
    Raw
  

              set_B.smi.gz
            
          
            View raw
              (Sorry about that, but we can’t show files that are this big right now.)
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "e3126b59",
	"metadata": {},
	"outputs": [],
	"source": [
	"import chemfp\n",
	"import pandas as pd"
	]
	},
	{
	"cell_type": "markdown",
	"id": "783d5d82",
	"metadata": {},
	"source": [
	"Let's assume we have the set of compounds in set_A.smi.gz and we want to purchase a diverse set of compounds from set_B.smi.gz that are not similar to those in set_A.smi.gz. "
	]
	},
	{
	"cell_type": "markdown",
	"id": "2dc8270a",
	"metadata": {},
	"source": [
	"1. Convert SMILES to RDKit fingerprints and read into arenas"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "a73183a1",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "960047c7a4d74624ac117c57e743887d",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"set_A.smi.gz: 0%\| …"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	}
	],
	"source": [
	"arena_A = chemfp.rdkit2fps(\"set_A.smi.gz\",\"set_A.fpb\", overwrite=False).load_output()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "a69a12bc",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "926c2c5d024a4ec996865439556d9df8",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"set_B.smi.gz: 0%\| …"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	}
	],
	"source": [
	"arena_B = chemfp.rdkit2fps(\"set_B.smi.gz\",\"set_B.fpb\", overwrite=False).load_output()"
	]
	},
	{
	"cell_type": "markdown",
	"id": "87d5cb54",
	"metadata": {},
	"source": [
	"2. Calculate the similarity of molecules in set_B to molecules in set_A"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "bf24b022",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "e9be523f439542be9f36b278121d0d2c",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"queries: 0%\| …"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	}
	],
	"source": [
	"result = chemfp.simsearch(queries=arena_B, targets=arena_A, k=1)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b7d95548",
	"metadata": {},
	"source": [
	"3. Get the indicies of the molecules in set_B with similarity less than the cutoff value"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"id": "f5f1d1fc",
	"metadata": {},
	"outputs": [],
	"source": [
	"similarity_cutoff = 0.25\n",
	"not_similar_df = result.to_pandas().query(\"score < @similarity_cutoff\")\n",
	"not_similar_idx = [arena_B.get_index_by_id(x) for x in not_similar_df.query_id]"
	]
	},
	{
	"cell_type": "markdown",
	"id": "652f93cc",
	"metadata": {},
	"source": [
	"4. Create an arena with not_similar_index"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"id": "ec516ba9",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"38651"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"not_similar_arena = arena_B.copy(indices=not_similar_idx)\n",
	"len(not_similar_arena)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8075d488",
	"metadata": {},
	"source": [
	"5. Cluster the compounds in not_similar_df"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "9b92c718",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "145a3c73278d481b9d14b3e0803402f0",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"NxN: 5%\|5 \| 2000/38651 [00:00<00:11, 3109.89/s]"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	}
	],
	"source": [
	"clusters = chemfp.butina(fingerprints=not_similar_arena,NxN_threshold=0.35)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "457275cc",
	"metadata": {},
	"source": [
	"6. Get the cluster centers"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"id": "3c2de564",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"856"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"center_df = clusters.to_pandas().query(\"type=='CENTER'\")\n",
	"len(center_df)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f8e6f499",
	"metadata": {},
	"source": [
	"7. Read the SMILES for set_B so we can merge this with center_df. Note that we need to change the datatype of the \"Name\" column to str to be consistent with the dataframe returned by ChemFP. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"id": "ecf56df7",
	"metadata": {},
	"outputs": [],
	"source": [
	"df_B = pd.read_csv(\"set_B.smi.gz\",sep=\" \",names=[\"SMILES\",\"Name\"])\n",
	"df_B.Name = df_B.Name.astype(str)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "6520ecf5",
	"metadata": {},
	"source": [
	"8. Merge the datasets"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"id": "8940ef52",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>SMILES</th>\n",
	" <th>Name</th>\n",
	" <th>cluster</th>\n",
	" <th>id</th>\n",
	" <th>type</th>\n",
	" <th>score</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>CCCCCCCCc1ccc(-c2ccc(C3=CN(/N=C/c4ccc(OCC)cc4)...</td>\n",
	" <td>504767809</td>\n",
	" <td>469</td>\n",
	" <td>504767809</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>Cc1ccc(S(=O)(=O)c2ccc(N/N=C/c3s/c(=N/S(=O)(=O)...</td>\n",
	" <td>97980914</td>\n",
	" <td>847</td>\n",
	" <td>97980914</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>CC1(C)CC(=O)C2=C(C1)N(c1nnc(SCc3ccccc3Cl)s1)C(...</td>\n",
	" <td>98006314</td>\n",
	" <td>170</td>\n",
	" <td>98006314</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>Clc1ccc(COc2cccc(/C=N/Nc3nc(-c4ccc(Br)cc4)cs3)...</td>\n",
	" <td>409074051</td>\n",
	" <td>112</td>\n",
	" <td>409074051</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>CCCCCCCCOc1ccc(-c2nnc(-c3ccc(-n4c5cc(Br)ccc5c5...</td>\n",
	" <td>97971317</td>\n",
	" <td>430</td>\n",
	" <td>97971317</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>...</th>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" <td>...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>851</th>\n",
	" <td>COc1ccc(N2C(=S)N(C(=O)Oc3ccccc3)/C(=N/c3c(C)cc...</td>\n",
	" <td>33405087</td>\n",
	" <td>635</td>\n",
	" <td>33405087</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>852</th>\n",
	" <td>C=CCOC(=O)c1cc(-c2ccc(/C=C3\\C(=O)N(c4ccc(Cl)c(...</td>\n",
	" <td>100147482</td>\n",
	" <td>261</td>\n",
	" <td>100147482</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>853</th>\n",
	" <td>CCc1ccc(-c2cs/c(=N\\C)n2/N=C/c2ccc(OS(=O)(=O)c3...</td>\n",
	" <td>102960331</td>\n",
	" <td>571</td>\n",
	" <td>102960331</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>854</th>\n",
	" <td>Cc1ccc(/N=C2\\S/C(=C/c3cc([N+](=O)[O-])ccc3N3CC...</td>\n",
	" <td>408914622</td>\n",
	" <td>84</td>\n",
	" <td>408914622</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>855</th>\n",
	" <td>COC(=O)CC[C@H](C)[C@H]1CC[C@H]2[C@@H]3CC[C@H]4...</td>\n",
	" <td>118924972</td>\n",
	" <td>187</td>\n",
	" <td>118924972</td>\n",
	" <td>CENTER</td>\n",
	" <td>1.0</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"<p>856 rows × 6 columns</p>\n",
	"</div>"
	],
	"text/plain": [
	" SMILES Name cluster \\\n",
	"0 CCCCCCCCc1ccc(-c2ccc(C3=CN(/N=C/c4ccc(OCC)cc4)... 504767809 469 \n",
	"1 Cc1ccc(S(=O)(=O)c2ccc(N/N=C/c3s/c(=N/S(=O)(=O)... 97980914 847 \n",
	"2 CC1(C)CC(=O)C2=C(C1)N(c1nnc(SCc3ccccc3Cl)s1)C(... 98006314 170 \n",
	"3 Clc1ccc(COc2cccc(/C=N/Nc3nc(-c4ccc(Br)cc4)cs3)... 409074051 112 \n",
	"4 CCCCCCCCOc1ccc(-c2nnc(-c3ccc(-n4c5cc(Br)ccc5c5... 97971317 430 \n",
	".. ... ... ... \n",
	"851 COc1ccc(N2C(=S)N(C(=O)Oc3ccccc3)/C(=N/c3c(C)cc... 33405087 635 \n",
	"852 C=CCOC(=O)c1cc(-c2ccc(/C=C3\\C(=O)N(c4ccc(Cl)c(... 100147482 261 \n",
	"853 CCc1ccc(-c2cs/c(=N\\C)n2/N=C/c2ccc(OS(=O)(=O)c3... 102960331 571 \n",
	"854 Cc1ccc(/N=C2\\S/C(=C/c3cc([N+](=O)[O-])ccc3N3CC... 408914622 84 \n",
	"855 COC(=O)CC[C@H](C)[C@H]1CC[C@H]2[C@@H]3CC[C@H]4... 118924972 187 \n",
	"\n",
	" id type score \n",
	"0 504767809 CENTER 1.0 \n",
	"1 97980914 CENTER 1.0 \n",
	"2 98006314 CENTER 1.0 \n",
	"3 409074051 CENTER 1.0 \n",
	"4 97971317 CENTER 1.0 \n",
	".. ... ... ... \n",
	"851 33405087 CENTER 1.0 \n",
	"852 100147482 CENTER 1.0 \n",
	"853 102960331 CENTER 1.0 \n",
	"854 408914622 CENTER 1.0 \n",
	"855 118924972 CENTER 1.0 \n",
	"\n",
	"[856 rows x 6 columns]"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"merge_df = df_B.merge(center_df,left_on='Name',right_on='id')\n",
	"merge_df"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b0aaf008",
	"metadata": {},
	"source": [
	"9. Save the merged dataframe"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"id": "7c430a1e",
	"metadata": {},
	"outputs": [],
	"source": [
	"merge_df.to_csv(\"cmpds_to_purchase.csv\",index=False)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.11.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}