Skip to content

Instantly share code, notes, and snippets.

@tmtmtmtm
Last active September 23, 2018 12:13
Show Gist options
  • Save tmtmtmtm/cd7a186ea12693be7a103558d1abf4f6 to your computer and use it in GitHub Desktop.
Save tmtmtmtm/cd7a186ea12693be7a103558d1abf4f6 to your computer and use it in GitHub Desktop.
Adding Northern Ireland Assembly IDs to Wikdiata

Last week we got a new Wikidata property for "Northern Ireland Assembly ID"

My goal this morning was to add that for all members, current and historic.

Step 1: Get the IDs

The Northern Ireland Assembly has a useful API that can give us all this data, so I knocked up a quick script to get them all: https://github.com/everypolitician-scrapers/ni-assembly-members/blob/master/scraper.rb

I've written hundreds of scrapers like this, and can cobble something like this together from previous versions really quickly. The only part that took any time was splitting out the prefixes and suffixes from people's names, but even that was based on a previous scraper

Step 2: Run the scraper on Morph.io

For a one-off this wouldn't be entirely necessary, but part of this process involves also integrating the data into EveryPolitician, rather than just adding it directly to Wikidata. Adding it to Morph takes care of running the scraper every day, triggering a webhook when it's finished, giving us an API etc: https://morph.io/everypolitician-scrapers/ni-assembly-members

Step 3: Integrate the data into EveryPolitician

EveryPolitician takes data from a variety of sources, and merges these all together. To make this possible it has a built-in reconciliation tool, which makes it really easy to map the IDs from this source to those we already have. (Other possible approaches would include Mix'n'Match and OpenRefine)

The vast majority of people here had an exact, or very close, name match against the existing EveryPolitician Northern Ireland data, but there were a few edge cases that needed manual intervention:

  • There have been two MLAs called Mark Durkan, so I needed to take care to make sure each gets reconciled correctly
  • There were a few MLAs who appear under the names they took when becoming Lords (Lord Bannside, Lord Morrow of Clogher Valley, Lord Empey of Shandon, Lord Kilclooney of Armagh)
  • Pam Cameron has previously been called Pam Lewis and Pam Brown

Step 4: detour to bring EveryPolitician and Wikdiata up-to-date

After reconciliation there were still two members un-matched: John Blair, and Emma Rogan. Both of them joined the Assembly mid-term, and neither EveryPolitician nor Wikidata were aware that they were now members. So I took a detour to make sure that both of these sources were up to date in their lists of members, then re-ran the reconcilation step again to match up the final two.

Step 5: Generate Wikidata → AIMS ID map

Now that the data is all included in EveryPolitician this is fairly trivial:

puts EveryPolitician::Popolo.read("ep-popolo-v1.0.json").persons
      select(&:wikidata).
      select { |p| p.identifier("aims") }.
      map { |p|  %Q(%s\tP5870\t"%s") % [p.wikidata, p.identifier("aims") ] }'

Step 6: Feed the ID map to QuickStatements

The mapping generated at Step 5 is in the format QuickStatements requires. It takes care of adding all these IDs for us (skipping any that already exist).

Step 7: Update the Members' Report to show this ID

Wikidata already has a list of all the members, so I added an extra column to that to show the official Assembly ID. (I had thought that Listeria would automatically turn this into a link, but either I imagined that, or I'm missing some option somewhere).

Possible Next Steps

  • Write a Prompt to automatically compare the Assembly's list of members with Wikidata's to help prevent Wikidata drifing out of date in future — this is much simpler to do now that Wikidata has a property that can tie the Assembly IDs to Wikidata items.
  • Investigate why Listeria isn't turning the ID into a link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment