Skip to content

Instantly share code, notes, and snippets.

@felixlohmeier
Last active August 19, 2019 13:13
Show Gist options
  • Save felixlohmeier/065727cffeafb216c24f730c40f3b1f6 to your computer and use it in GitHub Desktop.
Save felixlohmeier/065727cffeafb216c24f730c40f3b1f6 to your computer and use it in GitHub Desktop.
Automate GND reconciliation for OpenRefine
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 1 column, instead of 2. in line 2.
name;beruf;ort
J. Weizenbaum;Informatiker;Berlin
Twain, Mark;Schriftsteller;
Kumar, Lalit;;
Jemand;;
[
{
"op": "core/recon",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "name",
"config": {
"mode": "standard-service",
"service": "https://lobid.org/gnd/reconcile",
"identifierSpace": "https://lobid.org/gnd",
"schemaSpace": "https://lobid.org/gnd",
"type": {
"id": "Person",
"name": "Person"
},
"autoMatch": true,
"columnDetails": [
{
"column": "beruf",
"propertyName": "Beruf oder Beschäftigung (Literal)",
"propertyID": "professionOrOccupationAsLiteral"
}
],
"limit": 0
},
"description": "Reconcile cells in column name to type Person"
},
{
"op": "core/extend-reconciled-data",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "name",
"endpoint": "https://lobid.org/gnd/reconcile",
"identifierSpace": "https://lobid.org/gnd",
"schemaSpace": "https://lobid.org/gnd",
"extension": {
"properties": [
{
"id": "professionOrOccupation",
"name": "Beruf oder Beschäftigung"
},
{
"id": "placeOfBirth",
"name": "Geburtsort"
},
{
"id": "placeOfDeath",
"name": "Sterbeort"
},
{
"id": "geographicAreaCode",
"name": "Ländercode"
}
]
},
"columnInsertIndex": 1,
"description": "Extend data at index 1 based on column name"
},
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "name: judgment",
"expression": "forNonBlank(cell.recon.judgment, v, v, if(isNonBlank(value), \"(unreconciled)\", \"(blank)\"))",
"columnName": "name",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "none",
"l": "none"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"description": "Remove rows"
},
{
"op": "core/column-removal",
"columnName": "beruf",
"description": "Remove column beruf"
},
{
"op": "core/column-removal",
"columnName": "ort",
"description": "Remove column ort"
}
]
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Automate GND reconciliation for OpenRefine in a Linux Bash environment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparations\n",
"\n",
"Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-19 13:11:22 URL:https://github-production-release-asset-2e65be.s3.amazonaws.com/80617276/11234c80-c030-11e9-8d8d-6b20776f164f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20190819%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190819T131122Z&X-Amz-Expires=300&X-Amz-Signature=9d24ce810d3d6acb6aff3430e75c5d98eea29e3ad689ae95e28c79a30bca4215&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dopenrefine-client_0-3-7_linux&response-content-type=application%2Foctet-stream [4322528/4322528] -> \"/home/jovyan/.local/bin/openrefine-client\" [1]\n"
]
}
],
"source": [
"wget -nv https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_linux -O ~/.local/bin/openrefine-client\n",
"chmod +x ~/.local/bin/openrefine-client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create project\n",
"\n",
"Download sample data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download to file lobid-gnd-reconciliation-data.csv complete\n"
]
}
],
"source": [
"openrefine-client --download \"https://gist.githubusercontent.com/felixlohmeier/065727cffeafb216c24f730c40f3b1f6/raw/4923c19cf8bd78d53d211f046bda1afd11bf7b72/lobid-gnd-reconciliation-data.csv\" --output lobid-gnd-reconciliation-data.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import file into OpenRefine"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"id: 1615020900072\n",
"rows: 4\n"
]
}
],
"source": [
"openrefine-client --create lobid-gnd-reconciliation-data.csv --separator=\";\" --projectName=\"lobid-gnd-reconciliation\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export project to terminal"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name\tberuf\tort\n",
"J. Weizenbaum\tInformatiker\tBerlin\n",
"Twain, Mark\tSchriftsteller\t\n",
"Kumar, Lalit\t\t\n",
"Jemand\t\t\n"
]
}
],
"source": [
"openrefine-client --export \"lobid-gnd-reconciliation\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Apply rules from json file\n",
"\n",
"Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download to file lobid-gnd-reconciliation-history.json complete\n"
]
}
],
"source": [
"openrefine-client --download \"https://gist.githubusercontent.com/felixlohmeier/065727cffeafb216c24f730c40f3b1f6/raw/5e245786cf273a967c9cd0c285f5a2e9f81f8439/lobid-gnd-reconciliation-history.json\" --output lobid-gnd-reconciliation-history.json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply transformations rules"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"File lobid-gnd-reconciliation-history.json has been successfully applied to project 1615020900072\n"
]
}
],
"source": [
"openrefine-client --apply lobid-gnd-reconciliation-history.json \"lobid-gnd-reconciliation\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Export project to terminal again"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name\tBeruf oder Beschäftigung\tGeburtsort\tSterbeort\tLändercode\n",
"Weizenbaum, Joseph\tInformatiker\tBerlin\tBerlin\tUSA\n",
"\tMathematiker\t\t\tDeutschland\n",
"Twain, Mark\tLotse\tFlorida, Mo.\tRedding, Conn.\tUSA\n",
"\tSchriftsteller\t\t\t\n",
"\tDrucker\t\t\t\n",
"\tJournalist\t\t\t\n",
"\tSoldat\t\t\t\n",
"Kumar, Lalit\tElektroingenieur\tDelhi\t\tIndien\n"
]
}
],
"source": [
"openrefine-client --export \"lobid-gnd-reconciliation\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export project to file\n",
"\n",
"Export data in Excel (.xls) format"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Export to file lobid-gnd-reconciliation.csv complete\n"
]
}
],
"source": [
"openrefine-client --export \"lobid-gnd-reconciliation\" --output lobid-gnd-reconciliation.csv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"name,Beruf oder Beschäftigung,Geburtsort,Sterbeort,Ländercode\n",
"\"Weizenbaum, Joseph\",Informatiker,Berlin,Berlin,USA\n",
",Mathematiker,,,Deutschland\n",
"\"Twain, Mark\",Lotse,\"Florida, Mo.\",\"Redding, Conn.\",USA\n",
",Schriftsteller,,,\n",
",Drucker,,,\n",
",Journalist,,,\n",
",Soldat,,,\n",
"\"Kumar, Lalit\",Elektroingenieur,Delhi,,Indien\n"
]
}
],
"source": [
"cat lobid-gnd-reconciliation.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cleanup"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Project 1615020900072 has been successfully deleted\n"
]
}
],
"source": [
"openrefine-client --delete \"lobid-gnd-reconciliation\""
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"rm lobid-gnd-reconciliation-data.csv lobid-gnd-reconciliation-history.json lobid-gnd-reconciliation.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting help"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Usage: openrefine-client [--help | OPTIONS]\n",
"\n",
"Script to provide a command line interface to an OpenRefine server.\n",
"\n",
"Options:\n",
" -h, --help show this help message and exit\n",
"\n",
" Connection options:\n",
" -H 127.0.0.1, --host=127.0.0.1\n",
" OpenRefine hostname (default: 127.0.0.1)\n",
" -P 3333, --port=3333\n",
" OpenRefine port (default: 3333)\n",
"\n",
" Commands:\n",
" -c [FILE], --create=[FILE]\n",
" Create project from file. The filename ending (e.g.\n",
" .csv) defines the input format\n",
" (csv,tsv,xml,json,txt,xls,xlsx,ods)\n",
" -l, --list List projects\n",
" --download=[URL] Download file from URL (e.g. example data). Combine\n",
" with --output to specify a filename.\n",
"\n",
" Commands with argument [PROJECTID/PROJECTNAME]:\n",
" -d, --delete Delete project\n",
" -f [FILE], --apply=[FILE]\n",
" Apply JSON rules to OpenRefine project\n",
" -E, --export Export project in tsv format to stdout.\n",
" -o [FILE], --output=[FILE]\n",
" Export project to file. The filename ending (e.g.\n",
" .tsv) defines the output format\n",
" (csv,tsv,xls,xlsx,html)\n",
" --template=[STRING]\n",
" Export project with templating. Provide (big) text\n",
" string that you enter in the *row template* textfield\n",
" in the export/templating menu in the browser app)\n",
" --info show project metadata\n",
"\n",
" General options:\n",
" --format=FILE_FORMAT\n",
" Override file detection (import: csv,tsv,xml,json\n",
" ,line-based,fixed-width,xls,xlsx,ods; export:\n",
" csv,tsv,html,xls,xlsx,ods)\n",
"\n",
" Create options:\n",
" --columnWidths=COLUMNWIDTHS\n",
" (txt/fixed-width), please provide widths in multiple\n",
" arguments, e.g. --columnWidths=7 --columnWidths=5\n",
" --encoding=ENCODING\n",
" (csv,tsv,txt), please provide short encoding name\n",
" (e.g. UTF-8)\n",
" --guessCellValueTypes=true/false\n",
" (xml,csv,tsv,txt,json, default: false)\n",
" --headerLines=HEADERLINES\n",
" (csv,tsv,txt/fixed-width,xls,xlsx,ods), default: 1,\n",
" default txt/fixed-width: 0\n",
" --ignoreLines=IGNORELINES\n",
" (csv,tsv,txt,xls,xlsx,ods), default: -1\n",
" --includeFileSources=true/false\n",
" (all formats), default: false\n",
" --limit=LIMIT (all formats), default: -1\n",
" --linesPerRow=LINESPERROW\n",
" (txt/line-based), default: 1\n",
" --processQuotes=true/false\n",
" (csv,tsv), default: true\n",
" --projectName=PROJECT_NAME\n",
" (all formats), default: filename\n",
" --projectTags=PROJECTTAGS\n",
" (all formats), please provide tags in multiple\n",
" arguments, e.g. --projectTags=beta\n",
" --projectTags=client1\n",
" --recordPath=RECORDPATH\n",
" (xml,json), please provide path in multiple arguments\n",
" without slashes, e.g. /collection/record/ should be\n",
" entered like this: --recordPath=collection\n",
" --recordPath=record, default xml: record, default\n",
" json: _ _\n",
" --separator=SEPARATOR\n",
" (csv,tsv), default csv: , default tsv: \\t\n",
" --sheets=SHEETS (xls,xlsx,ods), please provide sheets in multiple\n",
" arguments, e.g. --sheets=0 --sheets=1, default: 0\n",
" (first sheet)\n",
" --skipDataLines=SKIPDATALINES\n",
" (csv,tsv,txt,xls,xlsx,ods), default: 0, default line-\n",
" based: -1\n",
" --storeBlankCellsAsNulls=true/false\n",
" (csv,tsv,txt,xls,xlsx,ods), default: true\n",
" --storeBlankRows=true/false\n",
" (csv,tsv,txt,xls,xlsx,ods), default: true\n",
" --storeEmptyStrings=true/false\n",
" (xml,json), default: true\n",
" --trimStrings=true/false\n",
" (xml,json), default: false\n",
"\n",
" Templating options:\n",
" --mode=row-based/record-based\n",
" engine mode (default: row-based)\n",
" --prefix=PREFIX text string that you enter in the *prefix* textfield\n",
" in the browser app\n",
" --rowSeparator=ROWSEPARATOR\n",
" text string that you enter in the *row separator*\n",
" textfield in the browser app\n",
" --suffix=SUFFIX text string that you enter in the *suffix* textfield\n",
" in the browser app\n",
" --filterQuery=REGEX\n",
" Simple RegEx text filter on filterColumn, e.g. ^12015$\n",
" --filterColumn=COLUMNNAME\n",
" column name for filterQuery (default: name of first\n",
" column)\n",
" --facets=FACETS facets config in json format (may be extracted with\n",
" browser dev tools in browser app)\n",
" --splitToFiles=true/false\n",
" will split each row/record into a single file; it\n",
" specifies a presumably unique character series for\n",
" splitting; --prefix and --suffix will be applied to\n",
" all files; filename-prefix can be specified with\n",
" --output (default: %Y%m%d)\n",
" --suffixById=true/false\n",
" enhancement option for --splitToFiles; will generate\n",
" filename-suffix from values in key column\n",
"\n",
"Example data:\n",
" --download \"https://git.io/fj5hF\" --output=duplicates.csv\n",
" --download \"https://git.io/fj5ju\" --output=duplicates-deletion.json\n",
"\n",
"Basic commands:\n",
" --list # list all projects\n",
" --list -H 127.0.0.1 -P 80 # specify hostname and port\n",
" --create duplicates.csv # create new project from file\n",
" --info \"duplicates\" # show project metadata\n",
" --apply duplicates-deletion.json \"duplicates\" # apply rules in file to project\n",
" --export \"duplicates\" # export project to terminal in tsv format\n",
" --export --output=deduped.xls \"duplicates\" # export project to file in xls format\n",
" --delete \"duplicates\" # delete project\n",
"\n",
"Some more examples:\n",
" --info 1234567890123 # specify project by id\n",
" --create example.tsv --encoding=UTF-8\n",
" --create example.xml --recordPath=collection --recordPath=record\n",
" --create example.json --recordPath=_ --recordPath=_\n",
" --create example.xlsx --sheets=0\n",
" --create example.ods --sheets=0\n",
"\n",
"Example for Templating Export:\n",
" Cf. https://github.com/opencultureconsulting/openrefine-client#advanced-templating\n"
]
}
],
"source": [
"openrefine-client --help"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) is available as a one file executable for Windows, Mac OS and Linux. Client and server can be executed on different machines (host and port of the OpenRefine server can be specified, e.g. `-H 127.0.0.1 -P 80`).\n",
"\n",
"Please file an [issue](https://github.com/opencultureconsulting/openrefine-client/issues) if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Bash",
"language": "bash",
"name": "bash"
},
"language_info": {
"codemirror_mode": "shell",
"file_extension": ".sh",
"mimetype": "text/x-sh",
"name": "bash"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment