Skip to content

Instantly share code, notes, and snippets.

@tmtmtmtm
Last active January 8, 2023 09:24
Show Gist options
  • Save tmtmtmtm/68c99ebe6fdb407c2dcd39a2a63edde9 to your computer and use it in GitHub Desktop.
Save tmtmtmtm/68c99ebe6fdb407c2dcd39a2a63edde9 to your computer and use it in GitHub Desktop.
augmenting wikipedia infobox links with Wikidata IDs

wtf_wikipedia is a wonderful tool for extracting structured data from Wikipedia pages. One of the main ways I use it is to extract information from politicians' infoboxes about the positions they've held, to compare this with what Wikidata knows.

To make processing these a lot simpler, I've often wished that the JSON returned from wft_wikipedia could be augmented with the Wikidata IDs for any linked item. So, for example, when getting officeholder data for Kaja Kallas, instead of

          "office": {
            "text": "19th Prime Minister of Estonia",
            "links": [
              {
                "type": "internal",
                "page": "Prime Minister of Estonia"
              }
            ]
          }

I'd like to have

          "office": {
            "text": "19th Prime Minister of Estonia",
            "links": [
              {
                "type": "internal",
                "page": "Prime Minister of Estonia",
                "wikidata": "Q737115"
              }
            ]
          }

Today I worked out a series of transformations to make this happen.

1. Find all the page links

The first thing needed is a list of all the Wikipedia links that need turned into Wikidata IDs. Given a directory full of saved output from running wtf-wikipedia against multiple pages, these can be found with:

  jq -r '.. | .links? | .[]? | select(.type=="internal") | .page' wtf/*.json | sort | uniq

2. Look up their Wikipedia IDs

There are a few different ways of turning links of Wikipedia pages into Wikidata items, for example using the Wikipedia API. To avoid fiddling with pagination issues here, I decided to use SPARQL to look them up, with a query derived from the "Wikidata items of Wikipedia articles" example query.

Given the wikibase-cli sparql file enwiki-to-wikidata.js:

module.exports = (...titles) => {
  titles = titles.map(value => `"${value}"@en`).join(' ')

  return `
    SELECT ?title ?item WHERE {
      VALUES ?title { ${titles} }

      ?sitelink schema:about ?item;
        schema:isPartOf <https://en.wikipedia.org/>;
        schema:name ?title.
    }`
}

I can then turn the output of the previous command (taking care to quote all the strings, as many of them contain spaces) into a JSON mapping file:

  jq -r '.. | .links? | .[]? | select(.type=="internal") | .page' wtf/*.json | sort | uniq |
    sed 's/^/"/; s/$/"/' |
    xargs wd sparql helpers/enwiki-to-wikidata.js > /tmp/WDIDs.json

3. Turn that into a simpler lookup format

The output from the SPARQL query is in the format:

[
  {
    "title": "Raivo Aeg",
    "item": "Q12373459"
  },
  {
    "title": "Tiit Terik",
    "item": "Q12376941"
  }
]

But to make lookups easier, I want to collapse that into a simpler hash in the format:

{
  "Raivo Aeg": "Q12373459",
  "Tiit Terik": "Q12376941"
}

This seems like it should be really simple with jq, but I struggled a bit, ending up with an approach that seems much clumsier than I would have expected (suggestions for improving this, or indeed any of the workflow here, are very welcome):

  jq 'reduce .[] as $item ({}; .[$item.title] = $item.item)' /tmp/WDIDs.json > /tmp/LOOKUP.json

4. Adding the Wikidata IDs back into the JSON

This lookup file can then be passed as an argument to jq and used to inject the relevant "wikidata" in every "links" hash:

for j in wtf/*.json
do
  jq --argfile lookup /tmp/LOOKUP.json '
    walk(
      if type == "object" and .links? 
        then .links |= map(. + { wikidata: ($lookup[.page // ""] // "") }) 
        else . 
      end
    )
  ' $j | ifne sponge $j
done

And, hey presto, all the JSON files now include the Wikidata IDs for all referenced Wikipedia pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment