Last active
August 19, 2019 13:13
-
-
Save felixlohmeier/065727cffeafb216c24f730c40f3b1f6 to your computer and use it in GitHub Desktop.
Automate GND reconciliation for OpenRefine
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 1 column, instead of 2. in line 2.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
name;beruf;ort | |
J. Weizenbaum;Informatiker;Berlin | |
Twain, Mark;Schriftsteller; | |
Kumar, Lalit;; | |
Jemand;; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[ | |
{ | |
"op": "core/recon", | |
"engineConfig": { | |
"facets": [], | |
"mode": "row-based" | |
}, | |
"columnName": "name", | |
"config": { | |
"mode": "standard-service", | |
"service": "https://lobid.org/gnd/reconcile", | |
"identifierSpace": "https://lobid.org/gnd", | |
"schemaSpace": "https://lobid.org/gnd", | |
"type": { | |
"id": "Person", | |
"name": "Person" | |
}, | |
"autoMatch": true, | |
"columnDetails": [ | |
{ | |
"column": "beruf", | |
"propertyName": "Beruf oder Beschäftigung (Literal)", | |
"propertyID": "professionOrOccupationAsLiteral" | |
} | |
], | |
"limit": 0 | |
}, | |
"description": "Reconcile cells in column name to type Person" | |
}, | |
{ | |
"op": "core/extend-reconciled-data", | |
"engineConfig": { | |
"facets": [], | |
"mode": "row-based" | |
}, | |
"baseColumnName": "name", | |
"endpoint": "https://lobid.org/gnd/reconcile", | |
"identifierSpace": "https://lobid.org/gnd", | |
"schemaSpace": "https://lobid.org/gnd", | |
"extension": { | |
"properties": [ | |
{ | |
"id": "professionOrOccupation", | |
"name": "Beruf oder Beschäftigung" | |
}, | |
{ | |
"id": "placeOfBirth", | |
"name": "Geburtsort" | |
}, | |
{ | |
"id": "placeOfDeath", | |
"name": "Sterbeort" | |
}, | |
{ | |
"id": "geographicAreaCode", | |
"name": "Ländercode" | |
} | |
] | |
}, | |
"columnInsertIndex": 1, | |
"description": "Extend data at index 1 based on column name" | |
}, | |
{ | |
"op": "core/row-removal", | |
"engineConfig": { | |
"facets": [ | |
{ | |
"type": "list", | |
"name": "name: judgment", | |
"expression": "forNonBlank(cell.recon.judgment, v, v, if(isNonBlank(value), \"(unreconciled)\", \"(blank)\"))", | |
"columnName": "name", | |
"invert": false, | |
"omitBlank": false, | |
"omitError": false, | |
"selection": [ | |
{ | |
"v": { | |
"v": "none", | |
"l": "none" | |
} | |
} | |
], | |
"selectBlank": false, | |
"selectError": false | |
} | |
], | |
"mode": "row-based" | |
}, | |
"description": "Remove rows" | |
}, | |
{ | |
"op": "core/column-removal", | |
"columnName": "beruf", | |
"description": "Remove column beruf" | |
}, | |
{ | |
"op": "core/column-removal", | |
"columnName": "ort", | |
"description": "Remove column ort" | |
} | |
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Automate GND reconciliation for OpenRefine in a Linux Bash environment" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Preparations\n", | |
"\n", | |
"Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"2019-08-19 13:11:22 URL:https://github-production-release-asset-2e65be.s3.amazonaws.com/80617276/11234c80-c030-11e9-8d8d-6b20776f164f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20190819%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190819T131122Z&X-Amz-Expires=300&X-Amz-Signature=9d24ce810d3d6acb6aff3430e75c5d98eea29e3ad689ae95e28c79a30bca4215&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dopenrefine-client_0-3-7_linux&response-content-type=application%2Foctet-stream [4322528/4322528] -> \"/home/jovyan/.local/bin/openrefine-client\" [1]\n" | |
] | |
} | |
], | |
"source": [ | |
"wget -nv https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_linux -O ~/.local/bin/openrefine-client\n", | |
"chmod +x ~/.local/bin/openrefine-client" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Create project\n", | |
"\n", | |
"Download sample data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Download to file lobid-gnd-reconciliation-data.csv complete\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --download \"https://gist.githubusercontent.com/felixlohmeier/065727cffeafb216c24f730c40f3b1f6/raw/4923c19cf8bd78d53d211f046bda1afd11bf7b72/lobid-gnd-reconciliation-data.csv\" --output lobid-gnd-reconciliation-data.csv" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Import file into OpenRefine" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"id: 1615020900072\n", | |
"rows: 4\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --create lobid-gnd-reconciliation-data.csv --separator=\";\" --projectName=\"lobid-gnd-reconciliation\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Export project to terminal" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"name\tberuf\tort\n", | |
"J. Weizenbaum\tInformatiker\tBerlin\n", | |
"Twain, Mark\tSchriftsteller\t\n", | |
"Kumar, Lalit\t\t\n", | |
"Jemand\t\t\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --export \"lobid-gnd-reconciliation\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Apply rules from json file\n", | |
"\n", | |
"Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Download to file lobid-gnd-reconciliation-history.json complete\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --download \"https://gist.githubusercontent.com/felixlohmeier/065727cffeafb216c24f730c40f3b1f6/raw/5e245786cf273a967c9cd0c285f5a2e9f81f8439/lobid-gnd-reconciliation-history.json\" --output lobid-gnd-reconciliation-history.json" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Apply transformations rules" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"File lobid-gnd-reconciliation-history.json has been successfully applied to project 1615020900072\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --apply lobid-gnd-reconciliation-history.json \"lobid-gnd-reconciliation\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Export project to terminal again" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"name\tBeruf oder Beschäftigung\tGeburtsort\tSterbeort\tLändercode\n", | |
"Weizenbaum, Joseph\tInformatiker\tBerlin\tBerlin\tUSA\n", | |
"\tMathematiker\t\t\tDeutschland\n", | |
"Twain, Mark\tLotse\tFlorida, Mo.\tRedding, Conn.\tUSA\n", | |
"\tSchriftsteller\t\t\t\n", | |
"\tDrucker\t\t\t\n", | |
"\tJournalist\t\t\t\n", | |
"\tSoldat\t\t\t\n", | |
"Kumar, Lalit\tElektroingenieur\tDelhi\t\tIndien\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --export \"lobid-gnd-reconciliation\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Export project to file\n", | |
"\n", | |
"Export data in Excel (.xls) format" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Export to file lobid-gnd-reconciliation.csv complete\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --export \"lobid-gnd-reconciliation\" --output lobid-gnd-reconciliation.csv" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"name,Beruf oder Beschäftigung,Geburtsort,Sterbeort,Ländercode\n", | |
"\"Weizenbaum, Joseph\",Informatiker,Berlin,Berlin,USA\n", | |
",Mathematiker,,,Deutschland\n", | |
"\"Twain, Mark\",Lotse,\"Florida, Mo.\",\"Redding, Conn.\",USA\n", | |
",Schriftsteller,,,\n", | |
",Drucker,,,\n", | |
",Journalist,,,\n", | |
",Soldat,,,\n", | |
"\"Kumar, Lalit\",Elektroingenieur,Delhi,,Indien\n" | |
] | |
} | |
], | |
"source": [ | |
"cat lobid-gnd-reconciliation.csv" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Cleanup" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Project 1615020900072 has been successfully deleted\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --delete \"lobid-gnd-reconciliation\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"rm lobid-gnd-reconciliation-data.csv lobid-gnd-reconciliation-history.json lobid-gnd-reconciliation.csv" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Getting help" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Usage: openrefine-client [--help | OPTIONS]\n", | |
"\n", | |
"Script to provide a command line interface to an OpenRefine server.\n", | |
"\n", | |
"Options:\n", | |
" -h, --help show this help message and exit\n", | |
"\n", | |
" Connection options:\n", | |
" -H 127.0.0.1, --host=127.0.0.1\n", | |
" OpenRefine hostname (default: 127.0.0.1)\n", | |
" -P 3333, --port=3333\n", | |
" OpenRefine port (default: 3333)\n", | |
"\n", | |
" Commands:\n", | |
" -c [FILE], --create=[FILE]\n", | |
" Create project from file. The filename ending (e.g.\n", | |
" .csv) defines the input format\n", | |
" (csv,tsv,xml,json,txt,xls,xlsx,ods)\n", | |
" -l, --list List projects\n", | |
" --download=[URL] Download file from URL (e.g. example data). Combine\n", | |
" with --output to specify a filename.\n", | |
"\n", | |
" Commands with argument [PROJECTID/PROJECTNAME]:\n", | |
" -d, --delete Delete project\n", | |
" -f [FILE], --apply=[FILE]\n", | |
" Apply JSON rules to OpenRefine project\n", | |
" -E, --export Export project in tsv format to stdout.\n", | |
" -o [FILE], --output=[FILE]\n", | |
" Export project to file. The filename ending (e.g.\n", | |
" .tsv) defines the output format\n", | |
" (csv,tsv,xls,xlsx,html)\n", | |
" --template=[STRING]\n", | |
" Export project with templating. Provide (big) text\n", | |
" string that you enter in the *row template* textfield\n", | |
" in the export/templating menu in the browser app)\n", | |
" --info show project metadata\n", | |
"\n", | |
" General options:\n", | |
" --format=FILE_FORMAT\n", | |
" Override file detection (import: csv,tsv,xml,json\n", | |
" ,line-based,fixed-width,xls,xlsx,ods; export:\n", | |
" csv,tsv,html,xls,xlsx,ods)\n", | |
"\n", | |
" Create options:\n", | |
" --columnWidths=COLUMNWIDTHS\n", | |
" (txt/fixed-width), please provide widths in multiple\n", | |
" arguments, e.g. --columnWidths=7 --columnWidths=5\n", | |
" --encoding=ENCODING\n", | |
" (csv,tsv,txt), please provide short encoding name\n", | |
" (e.g. UTF-8)\n", | |
" --guessCellValueTypes=true/false\n", | |
" (xml,csv,tsv,txt,json, default: false)\n", | |
" --headerLines=HEADERLINES\n", | |
" (csv,tsv,txt/fixed-width,xls,xlsx,ods), default: 1,\n", | |
" default txt/fixed-width: 0\n", | |
" --ignoreLines=IGNORELINES\n", | |
" (csv,tsv,txt,xls,xlsx,ods), default: -1\n", | |
" --includeFileSources=true/false\n", | |
" (all formats), default: false\n", | |
" --limit=LIMIT (all formats), default: -1\n", | |
" --linesPerRow=LINESPERROW\n", | |
" (txt/line-based), default: 1\n", | |
" --processQuotes=true/false\n", | |
" (csv,tsv), default: true\n", | |
" --projectName=PROJECT_NAME\n", | |
" (all formats), default: filename\n", | |
" --projectTags=PROJECTTAGS\n", | |
" (all formats), please provide tags in multiple\n", | |
" arguments, e.g. --projectTags=beta\n", | |
" --projectTags=client1\n", | |
" --recordPath=RECORDPATH\n", | |
" (xml,json), please provide path in multiple arguments\n", | |
" without slashes, e.g. /collection/record/ should be\n", | |
" entered like this: --recordPath=collection\n", | |
" --recordPath=record, default xml: record, default\n", | |
" json: _ _\n", | |
" --separator=SEPARATOR\n", | |
" (csv,tsv), default csv: , default tsv: \\t\n", | |
" --sheets=SHEETS (xls,xlsx,ods), please provide sheets in multiple\n", | |
" arguments, e.g. --sheets=0 --sheets=1, default: 0\n", | |
" (first sheet)\n", | |
" --skipDataLines=SKIPDATALINES\n", | |
" (csv,tsv,txt,xls,xlsx,ods), default: 0, default line-\n", | |
" based: -1\n", | |
" --storeBlankCellsAsNulls=true/false\n", | |
" (csv,tsv,txt,xls,xlsx,ods), default: true\n", | |
" --storeBlankRows=true/false\n", | |
" (csv,tsv,txt,xls,xlsx,ods), default: true\n", | |
" --storeEmptyStrings=true/false\n", | |
" (xml,json), default: true\n", | |
" --trimStrings=true/false\n", | |
" (xml,json), default: false\n", | |
"\n", | |
" Templating options:\n", | |
" --mode=row-based/record-based\n", | |
" engine mode (default: row-based)\n", | |
" --prefix=PREFIX text string that you enter in the *prefix* textfield\n", | |
" in the browser app\n", | |
" --rowSeparator=ROWSEPARATOR\n", | |
" text string that you enter in the *row separator*\n", | |
" textfield in the browser app\n", | |
" --suffix=SUFFIX text string that you enter in the *suffix* textfield\n", | |
" in the browser app\n", | |
" --filterQuery=REGEX\n", | |
" Simple RegEx text filter on filterColumn, e.g. ^12015$\n", | |
" --filterColumn=COLUMNNAME\n", | |
" column name for filterQuery (default: name of first\n", | |
" column)\n", | |
" --facets=FACETS facets config in json format (may be extracted with\n", | |
" browser dev tools in browser app)\n", | |
" --splitToFiles=true/false\n", | |
" will split each row/record into a single file; it\n", | |
" specifies a presumably unique character series for\n", | |
" splitting; --prefix and --suffix will be applied to\n", | |
" all files; filename-prefix can be specified with\n", | |
" --output (default: %Y%m%d)\n", | |
" --suffixById=true/false\n", | |
" enhancement option for --splitToFiles; will generate\n", | |
" filename-suffix from values in key column\n", | |
"\n", | |
"Example data:\n", | |
" --download \"https://git.io/fj5hF\" --output=duplicates.csv\n", | |
" --download \"https://git.io/fj5ju\" --output=duplicates-deletion.json\n", | |
"\n", | |
"Basic commands:\n", | |
" --list # list all projects\n", | |
" --list -H 127.0.0.1 -P 80 # specify hostname and port\n", | |
" --create duplicates.csv # create new project from file\n", | |
" --info \"duplicates\" # show project metadata\n", | |
" --apply duplicates-deletion.json \"duplicates\" # apply rules in file to project\n", | |
" --export \"duplicates\" # export project to terminal in tsv format\n", | |
" --export --output=deduped.xls \"duplicates\" # export project to file in xls format\n", | |
" --delete \"duplicates\" # delete project\n", | |
"\n", | |
"Some more examples:\n", | |
" --info 1234567890123 # specify project by id\n", | |
" --create example.tsv --encoding=UTF-8\n", | |
" --create example.xml --recordPath=collection --recordPath=record\n", | |
" --create example.json --recordPath=_ --recordPath=_\n", | |
" --create example.xlsx --sheets=0\n", | |
" --create example.ods --sheets=0\n", | |
"\n", | |
"Example for Templating Export:\n", | |
" Cf. https://github.com/opencultureconsulting/openrefine-client#advanced-templating\n" | |
] | |
} | |
], | |
"source": [ | |
"openrefine-client --help" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) is available as a one file executable for Windows, Mac OS and Linux. Client and server can be executed on different machines (host and port of the OpenRefine server can be specified, e.g. `-H 127.0.0.1 -P 80`).\n", | |
"\n", | |
"Please file an [issue](https://github.com/opencultureconsulting/openrefine-client/issues) if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Bash", | |
"language": "bash", | |
"name": "bash" | |
}, | |
"language_info": { | |
"codemirror_mode": "shell", | |
"file_extension": ".sh", | |
"mimetype": "text/x-sh", | |
"name": "bash" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment