wtf_wikipedia
is a wonderful tool for extracting structured data from Wikipedia pages. One of the main ways I use it is to extract information from politicians' infoboxes about the positions they've held, to compare this with what Wikidata knows.
To make processing these a lot simpler, I've often wished that the JSON returned from wft_wikipedia
could be augmented with the Wikidata IDs for any linked item. So, for example, when getting officeholder data for Kaja Kallas, instead of
"office": {
"text": "19th Prime Minister of Estonia",
"links": [
{
"type": "internal",
"page": "Prime Minister of Estonia"
}
]
}
I'd like to have
"office": {
"text": "19th Prime Minister of Estonia",
"links": [
{
"type": "internal",
"page": "Prime Minister of Estonia",
"wikidata": "Q737115"
}
]
}
Today I worked out a series of transformations to make this happen.
The first thing needed is a list of all the Wikipedia links that need turned into Wikidata IDs. Given a directory full of saved output from running wtf-wikipedia
against multiple pages, these can be found with:
jq -r '.. | .links? | .[]? | select(.type=="internal") | .page' wtf/*.json | sort | uniq
There are a few different ways of turning links of Wikipedia pages into Wikidata items, for example using the Wikipedia API. To avoid fiddling with pagination issues here, I decided to use SPARQL to look them up, with a query derived from the "Wikidata items of Wikipedia articles" example query.
Given the wikibase-cli
sparql file enwiki-to-wikidata.js
:
module.exports = (...titles) => {
titles = titles.map(value => `"${value}"@en`).join(' ')
return `
SELECT ?title ?item WHERE {
VALUES ?title { ${titles} }
?sitelink schema:about ?item;
schema:isPartOf <https://en.wikipedia.org/>;
schema:name ?title.
}`
}
I can then turn the output of the previous command (taking care to quote all the strings, as many of them contain spaces) into a JSON mapping file:
jq -r '.. | .links? | .[]? | select(.type=="internal") | .page' wtf/*.json | sort | uniq |
sed 's/^/"/; s/$/"/' |
xargs wd sparql helpers/enwiki-to-wikidata.js > /tmp/WDIDs.json
The output from the SPARQL query is in the format:
[
{
"title": "Raivo Aeg",
"item": "Q12373459"
},
{
"title": "Tiit Terik",
"item": "Q12376941"
}
]
But to make lookups easier, I want to collapse that into a simpler hash in the format:
{
"Raivo Aeg": "Q12373459",
"Tiit Terik": "Q12376941"
}
This seems like it should be really simple with jq
, but I struggled a bit, ending up with an approach that seems much clumsier than I would have expected (suggestions for improving this, or indeed any of the workflow here, are very welcome):
jq 'reduce .[] as $item ({}; .[$item.title] = $item.item)' /tmp/WDIDs.json > /tmp/LOOKUP.json
This lookup file can then be passed as an argument to jq
and used to inject the relevant "wikidata" in every "links" hash:
for j in wtf/*.json
do
jq --argfile lookup /tmp/LOOKUP.json '
walk(
if type == "object" and .links?
then .links |= map(. + { wikidata: ($lookup[.page // ""] // "") })
else .
end
)
' $j | ifne sponge $j
done
And, hey presto, all the JSON files now include the Wikidata IDs for all referenced Wikipedia pages.