Skip to content

Instantly share code, notes, and snippets.

@gauravsdeshmukh
Created July 7, 2018 19:11
Show Gist options
  • Save gauravsdeshmukh/bf4759ae5d5599659debee43b4aac579 to your computer and use it in GitHub Desktop.
Save gauravsdeshmukh/bf4759ae5d5599659debee43b4aac579 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# DortmundCrawler\n",
"\n",
"This is a bot that scans the Dortmund Data Bank VLE Index for the supplied compound names and returns tabular VLE data in the form of arrays. To take a look at the website that it scrapes, click here: http://www.ddbst.com/en/EED/VLE/VLEindex.php\n",
"\n",
"It returns 4 things:\n",
"1. The tabular data in the form of arrays\n",
"2. The constant values associated with the tables (like constant temperature or pressure)\n",
"3. The status of the table (whether the table is T-X-Y (status=1) or P-X-Y (status=0))\n",
"4. The switch variable which indicates whether the mole fraction data is with respect to the first or second compound"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import scipy as sci\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we define some functions that make crawling the page easier and more systematic. The tables are located under headers named \"Data Table\" so if we locate these headers on the page, we can extract the tables below them."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Get the value of the constant (Pressure or Temperature) above the table\n",
"def GetConstantValue(useful_header):\n",
" useful_value=useful_header.find_previous(\"table\")\n",
" col=useful_value.find_all(\"td\")\n",
" constvalue=col[0].get_text()\n",
" constvalue=float(constvalue)\n",
" return constvalue\n",
"\n",
"#Get the all the headers named \"Data Table\" on the page\n",
"def GetUsefulHeaders(soup):\n",
" headers=soup.find_all(\"h4\")\n",
" useful_headers=[]\n",
" for header in headers:\n",
" text=header.get_text()\n",
" if(text==\"Data Table\"):\n",
" useful_headers.append(header)\n",
" return useful_headers\n",
"\n",
"#With reference to the headers extracted above, get the tables that follow them. \n",
"def GetUsefulTables(useful_headers):\n",
" number_of_tables=len(useful_headers)\n",
" list_of_arrays=[]\n",
" constvalues=sci.zeros(number_of_tables)\n",
" stats=sci.zeros(number_of_tables)\n",
" for i in range(1,number_of_tables):\n",
" useful_header=useful_headers[i]\n",
" useful_table=useful_header.find_next(\"table\")\n",
" rows=useful_table.find_all(\"tr\")\n",
" number_of_cols=len(rows[-1].find_all(\"td\"))\n",
" if(number_of_cols<3): #Skip tables with less than 3 columns (tables with only X-Y data)\n",
" continue\n",
" constvalues[i]=GetConstantValue(useful_header)\n",
" stats[i]=GetTableStat(useful_header)\n",
" arr=sci.zeros((len(rows),3))\n",
" \n",
" #Get values in data cells in the table\n",
" row_counter=0\n",
" for row in rows:\n",
" cols=row.find_all(\"td\")\n",
" for cell in range(len(cols)):\n",
" arr[row_counter,cell]=cols[cell].get_text()\n",
" row_counter+=1\n",
" \n",
" #Delete the first and last rows (mole fraction 0 and 1) \n",
" arr=sci.delete(arr,(0),axis=0)\n",
" if(arr[0,1]==0.):\n",
" arr=sci.delete(arr,(0),axis=0)\n",
" if(arr[-1,1]==1.):\n",
" arr=sci.delete(arr,(-1),axis=0)\n",
" list_of_arrays.append(arr)\n",
" \n",
" final_constvalues=constvalues[constvalues!=0]\n",
" final_stats=stats[constvalues!=0]\n",
" return [list_of_arrays,final_constvalues,final_stats]\n",
" \n",
"\n",
"#Get status of table, 1 for T-X-Y data and 2 for P-X-Y data\n",
"def GetTableStat(useful_header):\n",
" useful_value=useful_header.find_previous(\"table\")\n",
" col=useful_value.find_all(\"td\")\n",
" check=col[1].get_text()\n",
" if(check==\"K\"):\n",
" return 0\n",
" else:\n",
" return 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let us define the main Crawl function that will request the url and parse the page using the module BeautifulSoup. All the functions defined above are called in this function. "
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def Crawl(compound1,compound2):\n",
" comp1=f\"{compound1}\"\n",
" comp2=f\"{compound2}\"\n",
" page=requests.get(f\"http://www.ddbst.com/en/EED/VLE/VLE%20{comp1}%3B{comp2}.php\")\n",
" if(page.status_code==200): #status code 200 indicates that the url was requested successfully\n",
" switch=0\n",
" else:\n",
" switch=1\n",
" page=requests.get(f\"http://www.ddbst.com/en/EED/VLE/VLE%20{comp2}%3B{comp1}.php\")\n",
" soup=BeautifulSoup(page.text,\"html.parser\")\n",
"\n",
" useful_headers=GetUsefulHeaders(soup)\n",
" [tables,constantvalues,stats]=GetUsefulTables(useful_headers)\n",
" return [tables,constantvalues,stats,switch]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that all the required functions have been defined, let us choose two compounds (p-Xylene and Acetone) and scrape their VLE data from the Dortmund Data Bank website using our crawler."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0., 0.])"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"compound1=\"p-Xylene\"\n",
"compound2=\"Acetone\"\n",
"\n",
"[tables,constantvalues,stats,switch]=Crawl(compound1,compound2) #Call the main crawler function\n",
"stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since both the stat values are 0, both the tables are P-X-Y. Therefore, the constant values are temperature values in K. "
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 313.15, 353.15])"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"constantvalues"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us nicely frame the tables in dataframes."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The first table is the P-X-Y data at a constant temperature of 313.15 K\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pressure (kPa)</th>\n",
" <th>X</th>\n",
" <th>Y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>9.9</td>\n",
" <td>0.0473</td>\n",
" <td>0.6217</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>19.0</td>\n",
" <td>0.1427</td>\n",
" <td>0.8500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>29.2</td>\n",
" <td>0.2943</td>\n",
" <td>0.9148</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>40.7</td>\n",
" <td>0.5735</td>\n",
" <td>0.9543</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>50.3</td>\n",
" <td>0.8680</td>\n",
" <td>0.9844</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>53.5</td>\n",
" <td>0.9486</td>\n",
" <td>0.9851</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pressure (kPa) X Y\n",
"0 9.9 0.0473 0.6217\n",
"1 19.0 0.1427 0.8500\n",
"2 29.2 0.2943 0.9148\n",
"3 40.7 0.5735 0.9543\n",
"4 50.3 0.8680 0.9844\n",
"5 53.5 0.9486 0.9851"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"table_1=tables[0] #store the first table in table_1\n",
"df1=pd.DataFrame({\"Pressure (kPa)\":pd.Series(table_1[:,0]),\"X\":pd.Series(table_1[:,1]),\"Y\":pd.Series(table_1[:,2])})\n",
"print(f\"The first table is the P-X-Y data at a constant temperature of {constantvalues[0]} K\")\n",
"display(df1)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The second table is the P-X-Y data at a constant temperature of {constantvalues[1]} K\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Pressure (kPa)</th>\n",
" <th>X</th>\n",
" <th>Y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>33.600</td>\n",
" <td>0.0327</td>\n",
" <td>0.4535</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>75.400</td>\n",
" <td>0.1431</td>\n",
" <td>0.7640</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>144.700</td>\n",
" <td>0.5549</td>\n",
" <td>0.9342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>171.101</td>\n",
" <td>0.7242</td>\n",
" <td>0.9567</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>188.101</td>\n",
" <td>0.8716</td>\n",
" <td>0.9794</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>205.400</td>\n",
" <td>0.9712</td>\n",
" <td>0.9943</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Pressure (kPa) X Y\n",
"0 33.600 0.0327 0.4535\n",
"1 75.400 0.1431 0.7640\n",
"2 144.700 0.5549 0.9342\n",
"3 171.101 0.7242 0.9567\n",
"4 188.101 0.8716 0.9794\n",
"5 205.400 0.9712 0.9943"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"table_2=tables[1] #store the first table in table_2\n",
"df2=pd.DataFrame({\"Pressure (kPa)\":pd.Series(table_2[:,0]),\"X\":pd.Series(table_2[:,1]),\"Y\":pd.Series(table_2[:,2])})\n",
"print(\"The second table is the P-X-Y data at a constant temperature of {constantvalues[1]} K\")\n",
"display(df2)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment