Skip to content

Instantly share code, notes, and snippets.

@nickynicolson
Created January 28, 2021 10:56
Show Gist options
  • Save nickynicolson/11fe9e57a198d31fa010fb3feaa65d94 to your computer and use it in GitHub Desktop.
Save nickynicolson/11fe9e57a198d31fa010fb3feaa65d94 to your computer and use it in GitHub Desktop.
Direct coding against Open Refine reconciliation API
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Direct coding against an Open Refine reconciliation API\n",
"\n",
"[Open Refine](https://openrefine.org/) offers a powerful user interface to clean up datasets and link to authoritative sources using its reconciliation API (conforming to a [W3C specification](https://reconciliation-api.github.io/specs/latest)).\n",
"\n",
"This notebook outlines steps to program (in python) a data-cleaning pipeline using the reconciliation API outside of the open refine web interface (e.g. against data held in a pandas dataframe). (Note a python package [reconciler](https://github.com/jvfe/reconciler) performs a similar job, but as of 2021-01-09 does not appear to support properties).\n",
"\n",
"The service used is maintained by [RBG Kew](http://www.kew.org/science) to match [plant names](http://data1.kew.org/reconciliation/)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"service_url='http://data1.kew.org/reconciliation/reconcile/IpniName'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accessing service metadata"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"name\": \"IPNI Name Reconciliation Service\",\n",
" \"identifierSpace\": \"http://ipni.org/urn:lsid:ipni.org:names:\",\n",
" \"schemaSpace\": \"http://rdf.freebase.com/ns/type.object.id\",\n",
" \"view\": {\n",
" \"url\": \"http://ipni.org/urn:lsid:ipni.org:names:{{id}}\"\n",
" },\n",
" \"preview\": {\n",
" \"url\": \"http://ipni.org/urn:lsid:ipni.org:names:{{id}}\",\n",
" \"width\": 400,\n",
" \"height\": 400\n",
" },\n",
" \"suggest\": {\n",
" \"type\": {\n",
" \"service_url\": \"http://data1.kew.org\",\n",
" \"service_path\": \"/reconciliation/reconcile/IpniName/suggestType\",\n",
" \"flyout_service_url\": \"http://data1.kew.org\",\n",
" \"flyout_service_path\": \"/reconciliation/reconcile/IpniName/flyoutType/${id}\"\n",
" },\n",
" \"property\": {\n",
" \"service_url\": \"http://data1.kew.org\",\n",
" \"service_path\": \"/reconciliation/reconcile/IpniName/suggestProperty\",\n",
" \"flyout_service_url\": \"http://data1.kew.org\",\n",
" \"flyout_service_path\": \"/reconciliation/reconcile/IpniName/flyoutProperty/${id}\"\n",
" },\n",
" \"entity\": {\n",
" \"service_url\": \"http://data1.kew.org\",\n",
" \"service_path\": \"/reconciliation/reconcile/IpniName\",\n",
" \"flyout_service_url\": \"http://data1.kew.org\",\n",
" \"flyout_service_path\": \"/reconciliation/reconcile/IpniName/flyout/${id}\"\n",
" }\n",
" },\n",
" \"defaultTypes\": [\n",
" {\n",
" \"id\": \"/biology/organism_classification/scientific_name\",\n",
" \"name\": \"Scientific name\"\n",
" }\n",
" ]\n",
"}\n"
]
}
],
"source": [
"from pprint import pprint\n",
"import json\n",
"import requests\n",
"r=requests.get(service_url)\n",
"print(json.dumps(r.json(), indent=2))\n",
"service_metadata=r.json()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'http://ipni.org/urn:lsid:ipni.org:names:{{id}}'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"service_metadata[\"view\"][\"url\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The metadata for the service shows that the default type is \"/biology/organism_classification/scientific_name\".\n",
"\n",
"Calls to reconcile the data will be made against the endpoint, but the suggest API can be used to find out what extra information can be passed to the reconciliation service to improve the match results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Types\n",
"The service metadata (shown above) includes a default type (\"/biology/organism_classification/scientific_name\"), but this call to the suggest API (using an empty prefix) lists types available through the service:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"result\": [\n",
" {\n",
" \"id\": \"/biology/organism_classification/scientific_name\",\n",
" \"name\": \"Scientific name\"\n",
" }\n",
" ]\n",
"}\n"
]
}
],
"source": [
"service_url_types=service_url+'/suggestType?prefix='\n",
"r=requests.get(service_url_types)\n",
"print(json.dumps(r.json(), indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Properties\n",
"\n",
"Properties can be added to the reconciliation call to atomise the data. All available properties can be seen by supplying an empty prefix:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"result\": [\n",
" {\n",
" \"id\": \"epithet_1\",\n",
" \"name\": \"epithet_1\"\n",
" },\n",
" {\n",
" \"id\": \"epithet_2\",\n",
" \"name\": \"epithet_2\"\n",
" },\n",
" {\n",
" \"id\": \"epithet_3\",\n",
" \"name\": \"epithet_3\"\n",
" },\n",
" {\n",
" \"id\": \"basionym_author\",\n",
" \"name\": \"basionym_author\"\n",
" },\n",
" {\n",
" \"id\": \"publishing_author\",\n",
" \"name\": \"publishing_author\"\n",
" },\n",
" {\n",
" \"id\": \"full_name\",\n",
" \"name\": \"full_name\"\n",
" }\n",
" ]\n",
"}\n"
]
}
],
"source": [
"service_url_props=service_url+'/suggestProperty?prefix='\n",
"r=requests.get(service_url_props)\n",
"print(json.dumps(r.json(), indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Queries\n",
"\n",
"A query is a JSON string holding the text to be passed to the reconciliation service, along with (optional) properties as extra info.\n",
"\n",
"The code below defines a query and passes in to the reconciliation service, displaying the result."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"result\": [\n",
" {\n",
" \"id\": \"77103635-1\",\n",
" \"name\": \"Solanaceae Solanum sanchez-vegae S.Knapp\",\n",
" \"type\": [\n",
" {\n",
" \"id\": \"/biology/organism_classification/scientific_name\",\n",
" \"name\": \"Scientific name\"\n",
" }\n",
" ],\n",
" \"score\": 100.0,\n",
" \"match\": true\n",
" }\n",
" ]\n",
"}\n"
]
}
],
"source": [
"query={\"query\":\"Solanum sanchez-vegae\"}\n",
"url=service_url+'?query='+json.dumps(query)\n",
"r=requests.get(url)\n",
"print(json.dumps(r.json(), indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As above, but also using properties (note that the publishing author is somewhat fuzzy):"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"result\": [\n",
" {\n",
" \"id\": \"77103635-1\",\n",
" \"name\": \"Solanaceae Solanum sanchez-vegae S.Knapp\",\n",
" \"type\": [\n",
" {\n",
" \"id\": \"/biology/organism_classification/scientific_name\",\n",
" \"name\": \"Scientific name\"\n",
" }\n",
" ],\n",
" \"score\": 100.0,\n",
" \"match\": true\n",
" }\n",
" ]\n",
"}\n"
]
}
],
"source": [
"properties=[{\"p\":\"epithet_1\",\"pid\":\"epithet_1\",\"v\":\"Solanum\"}\n",
" ,{\"p\":\"epithet_2\",\"pid\":\"epithet_2\",\"v\":\"sanchez-vegae\"}\n",
" ,{\"p\":\"publishing_author\",\"pid\":\"publishing_author\",\"v\":\"Knapp\"}]\n",
"query={\"query\":\"Solanum sanchez-vegae\",\"properties\":properties}\n",
"url=service_url+'?query='+json.dumps(query)\n",
"r=requests.get(url)\n",
"print(json.dumps(r.json(), indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying in bulk\n",
"\n",
"Set up some test data to use when querying the reconciliation service"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>fullname_w_auth</th>\n",
" <th>genus</th>\n",
" <th>species</th>\n",
" <th>infra</th>\n",
" <th>bas_auth</th>\n",
" <th>pub_auth</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Hedera helix</td>\n",
" <td>Hedera</td>\n",
" <td>helix</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Quercus robur L.</td>\n",
" <td>Quercus</td>\n",
" <td>robur</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>L.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Ilex aquifolia</td>\n",
" <td>Ilex</td>\n",
" <td>aquifolia</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id fullname_w_auth genus species infra bas_auth pub_auth\n",
"0 0 Hedera helix Hedera helix NaN NaN NaN\n",
"1 1 Quercus robur L. Quercus robur NaN NaN L.\n",
"2 2 Ilex aquifolia Ilex aquifolia NaN NaN NaN"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"from io import StringIO\n",
"data=\"\"\"\n",
"id|fullname_w_auth|genus|species|infra|bas_auth|pub_auth\n",
"0|Hedera helix|Hedera|helix\n",
"1|Quercus robur L.|Quercus|robur|||L.\n",
"2|Ilex aquifolia|Ilex|aquifolia\n",
"\"\"\"\n",
"df = pd.read_csv(StringIO(data), sep=\"|\")\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Map the columns in the test dataset against the properties that can be used with the reconciliation service"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"col_prop_mapper=dict()\n",
"col_prop_mapper['genus']='epithet_1'\n",
"col_prop_mapper['species']='epithet_2'\n",
"col_prop_mapper['infra']='epithet_3'\n",
"col_prop_mapper['bas_auth']='basionym_author'\n",
"col_prop_mapper['pub_auth']='publishing_author'\n",
"col_prop_mapper['fullname_w_auth']='full_name'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define some helper functions to streamline communication with the reconciliation service.\n",
"- *buildQuery* will accept a row from the pandas dataframe and translate its values into a query object (that can be passed as JSON to the reconciliation service)\n",
"- *reconcile* will accept the query object, conert it to JSON, pass to the reconciliation service and read the data from the response"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def buildQuery(row, col_name, col_prop_mapper):\n",
" query={'query':row[col_name]}\n",
" # Add properties\n",
" properties=[]\n",
" for p_cname, p_pname in col_prop_mapper.items():\n",
" if pd.notnull(row[p_cname]):\n",
" property = {\"p\":p_pname,\"pid\":p_pname,\"v\":row[p_cname]}\n",
" properties.append(property)\n",
" query['properties']=properties\n",
" return query\n",
"\n",
"def reconcile(row,col,props):\n",
" id = None\n",
" query = buildQuery(row, col_name=col, col_prop_mapper=props)\n",
" query_json = json.dumps(query)\n",
" #print(query_json)\n",
" # pass to reconciliation service\n",
" url=service_url+'?query='+query_json\n",
" r = requests.get(url=url)\n",
" reco_results=[]\n",
" try:\n",
" res = r.json()['result'] \n",
" if res is not None:\n",
" for result in res:\n",
" id = result['id']\n",
" name = result['name']\n",
" reco_results.append([id,name]) \n",
" except:\n",
" pass\n",
" return reco_results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, use *apply* to call the reconciliation service, using the helper functions. As a single call may return multiple results, the results are exploded and the id and name then read into separate columns."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>fullname_w_auth</th>\n",
" <th>genus</th>\n",
" <th>species</th>\n",
" <th>infra</th>\n",
" <th>bas_auth</th>\n",
" <th>pub_auth</th>\n",
" <th>reco_id</th>\n",
" <th>reco_name</th>\n",
" <th>reco_link</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Hedera helix</td>\n",
" <td>Hedera</td>\n",
" <td>helix</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>90723-1</td>\n",
" <td>Araliaceae Hedera helix L.</td>\n",
" <td>http://ipni.org/urn:lsid:ipni.org:names:90723-1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Hedera helix</td>\n",
" <td>Hedera</td>\n",
" <td>helix</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>90722-1</td>\n",
" <td>Araliaceae Hedera helix Lowe</td>\n",
" <td>http://ipni.org/urn:lsid:ipni.org:names:90722-1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Quercus robur L.</td>\n",
" <td>Quercus</td>\n",
" <td>robur</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>L.</td>\n",
" <td>304293-2</td>\n",
" <td>Fagaceae Quercus robur L.</td>\n",
" <td>http://ipni.org/urn:lsid:ipni.org:names:304293-2</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Ilex aquifolia</td>\n",
" <td>Ilex</td>\n",
" <td>aquifolia</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>83052-1</td>\n",
" <td>Aquifoliaceae Ilex aquifolium Lour.</td>\n",
" <td>http://ipni.org/urn:lsid:ipni.org:names:83052-1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Ilex aquifolia</td>\n",
" <td>Ilex</td>\n",
" <td>aquifolia</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>83051-1</td>\n",
" <td>Aquifoliaceae Ilex aquifolium L.</td>\n",
" <td>http://ipni.org/urn:lsid:ipni.org:names:83051-1</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>Ilex aquifolia</td>\n",
" <td>Ilex</td>\n",
" <td>aquifolia</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>83050-1</td>\n",
" <td>Aquifoliaceae Ilex aquifolium Marshall</td>\n",
" <td>http://ipni.org/urn:lsid:ipni.org:names:83050-1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id fullname_w_auth genus species infra bas_auth pub_auth \\\n",
"0 0 Hedera helix Hedera helix NaN NaN NaN \n",
"0 0 Hedera helix Hedera helix NaN NaN NaN \n",
"1 1 Quercus robur L. Quercus robur NaN NaN L. \n",
"2 2 Ilex aquifolia Ilex aquifolia NaN NaN NaN \n",
"2 2 Ilex aquifolia Ilex aquifolia NaN NaN NaN \n",
"2 2 Ilex aquifolia Ilex aquifolia NaN NaN NaN \n",
"\n",
" reco_id reco_name \\\n",
"0 90723-1 Araliaceae Hedera helix L. \n",
"0 90722-1 Araliaceae Hedera helix Lowe \n",
"1 304293-2 Fagaceae Quercus robur L. \n",
"2 83052-1 Aquifoliaceae Ilex aquifolium Lour. \n",
"2 83051-1 Aquifoliaceae Ilex aquifolium L. \n",
"2 83050-1 Aquifoliaceae Ilex aquifolium Marshall \n",
"\n",
" reco_link \n",
"0 http://ipni.org/urn:lsid:ipni.org:names:90723-1 \n",
"0 http://ipni.org/urn:lsid:ipni.org:names:90722-1 \n",
"1 http://ipni.org/urn:lsid:ipni.org:names:304293-2 \n",
"2 http://ipni.org/urn:lsid:ipni.org:names:83052-1 \n",
"2 http://ipni.org/urn:lsid:ipni.org:names:83051-1 \n",
"2 http://ipni.org/urn:lsid:ipni.org:names:83050-1 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reconcile\n",
"df['reco_results']=df.apply(lambda row: reconcile(row, col='fullname_w_auth', props=col_prop_mapper), axis=1)\n",
"# Explode reconciliation results so that each in own row\n",
"df=df.explode('reco_results')\n",
"# Extract ID and name from exploded reconciliation results \n",
"mask=(df.reco_results.notnull())\n",
"df.loc[mask,'reco_id']=df[mask].reco_results.apply(lambda x: x[0])\n",
"df.loc[mask,'reco_name']=df[mask].reco_results.apply(lambda x: x[1])\n",
"# Drop source column as it is no longer needed\n",
"df.drop(columns=['reco_results'],inplace=True)\n",
"# Add a link to the reconciled entity\n",
"mask=(df.reco_id.notnull())\n",
"df.loc[mask,'reco_link']=df[mask].reco_id.apply(lambda id: service_metadata[\"view\"][\"url\"].replace('{{id}}',id))\n",
"df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment