Skip to content

Instantly share code, notes, and snippets.

@tmtmtmtm
Last active July 13, 2020 10:53
Show Gist options
  • Save tmtmtmtm/4a05c7960e031b73c2df813e04a5675d to your computer and use it in GitHub Desktop.
Save tmtmtmtm/4a05c7960e031b73c2df813e04a5675d to your computer and use it in GitHub Desktop.
Add Members of the 11th Lithuanian Seimas to Wikidata

Step 1: Add a tracking page

The first step for something like this is always to not only see what Wikidata already knows, but to capture that with a Listeria page, so we can track changes over time. Here that's WikiProject every politician/Lithuania/data/Seimas/11th. That initially has no members, which is sometimes a sign that the data has been entered in different way: e.g. with start/end dates rather than legislative terms. But a check for that approach (https://w.wiki/Wmi) shows no entries either, so we're working from a clean slate, and can continue with the term-based approach already taken for the 12th Seimas.

Step 2: Look for a Wikipedia category

If any of the Wikipedias have a category of "Member of the 11th Seimas" or equivalent, we can bulk create some placeholder statements from that, but I wasn't able to find any.

Step 3: Scrape Wikipedia list page

Usually the best source of members of a legislature is the local-language wikipedia, but I couldn't find a list for the 11th Seimas on Lithuanian Wikipedia, only on English Wikipedia: https://en.wikipedia.org/wiki/Eleventh_Seimas_of_Lithuania#Members

So I'll start by scraping that: scraper.rb

Only about a third of the members in the resulting CSV have enwiki pages, and can therefore be automatically resolved to Wikidata IDs, but that's still a good start.

Step 4: Sanity-check the members

Usually this means checking the members who don't already have a suitable P39, but we already know that none of these ones do (from step 1), so we check all of them, by generating a list of the IDs (xsv select id wikipedia.csv | egrep \^Q | sed -e 's/Q/wd:Q/' | xargs | pbcopy), and pasting them into a pre-canned query to create https://tinyurl.com/lt11members

We're looking here for anything that suggests a mis-matched Wikipedia link, usually pointing at someone different with the same name, but nothing here stands out as problematic.

Step 5: Create new P39 statements

We can use wikibase-cli to create new P39 ("position held") statements for each of these people.

add_p39.js:

module.exports = id => ({
  id,
  claims: {
    P39: {
      value: 'Q18507240',
      qualifiers: {
        P2937: 'Q64510'
      },
      references: {
        P143: 'Q328',
        P4656: 'https://en.wikipedia.org/wiki/Eleventh_Seimas_of_Lithuania'
      },
    }
  }
})

That will add a P39 of "Member of the Seimas", with a "parliamentary term" qualifier to the 11th Seimas, and a reference to the English Wikipedia page we got these from.

And as we want to create these for everyone in the file (as none already have one), that's as simple as:

xsv select id wikipedia.csv | egrep \^Q | xargs -n 1 | wd ee add_P39.js --batch --summary "Add P39s from https://en.wikipedia.org/wiki/Eleventh_Seimas_of_Lithuania"

That went through as https://tools.wmflabs.org/editgroups/b/wikibase-cli/a8586007c1b7d/ (with 56 additions)

Step 6: Sanity Check our Party data

While waiting for the QueryService to catch up with that, so we can add party information for these people, we can check that the party IDs we got from the scraper are sane.

We take the list of those:

xsv select party wikipedia.csv | egrep \^Q | sort | uniq  | sed -e 's/Q/wd:Q/' | xargs  | tee >(pbcopy)

And paste them into another pre-canned query for this purpose: https://w.wiki/Wmm

Again, nothing looks out of place, so it should be safe to set those as P4100 qualifiers for the members.

Step 6: Compare Wikidata with Wikipedia

Once the QueryService has caught up with our Step 4 additions, we can generate a wikidata.json file from a SPARQL query (this would usually happen earlier, before adding anything at all, but it would have been blank then)

Then running the usual check-data script against that gets additional information to be added (in this case only the party info, as none of the constituencies in the Wikipedia page are links).

bundle exec ruby check-data.rb wikipedia.csv wikidata.json | fgrep P4100 | wd aq --batch --summary "Add missing Parliamentary Groups"

Step 7: Update our Tracking Page

After another brief pause for the QueryService to catch up again, we can see how our tracking page now looks.

As expected, we have just over 50 members, most of whom have Party information, but I think we can get better than that.

Step 8: Build a lookup table for name matches

So far we've only been able to add people from the English Wikipedia table who themselves have pages on English Wikipedia, as it's via those that we get the relevant Wikidata IDs.

But lots of the other members do already have Wikidata items, so it should be possible to reconile lots of these automatically via a simple name comparison.

For that we want a list of all the labels used on anyone who is listed as having been a member of the Seimas:

all-members.sparql:

SELECT DISTINCT ?item ?itemLabel WHERE {
  ?item wdt:P39 wd:Q18507240 ; rdfs:label ?itemLabel .
}

There are a lot of duplicates there, even with the SELECT DISTINCT (I think because each of these labels has a distinct language code attached), so we want to do a bit of shaping on the output of that:

wd sparql all-members.sparql > /tmp/all-members.json 
jq -r '.[] | [.item.label, .item.value] | @csv' /tmp/all-members.json | sort | uniq > all-members.csv

That gets us a lookup table of all the labels currently in use, so we can adapt our scraper to look up missing people in that.

Impressively that actually gets us to 100% coverage of members, so we can now repeat the above adding-to-Wikidata steps again.

Step 9: Sanity Check the new Member IDs

Our new list (at https://tinyurl.com/lt11members2) again looks fine. There's one death date from 2012, but it does seem like he was a member then, so we're good to add these.

Step 10: Create the new P39s

We can't do this based on the output of xsv select this time (as some of those now already have P39s), so we need to run check-data.rb and cut'n'paste the IDs it warns have no P39s:

pbpaste | xargs -n 1 | wd ee add_P39.js --batch --summary "Add P39s from https://en.wikipedia.org/wiki/Eleventh_Seimas_of_Lithuania"

That ran as https://tools.wmflabs.org/editgroups/b/wikibase-cli/537f9c5eaf114, and added 99 new P39s.

Step 11: Add the missing party info

Again pausing to let the query service catch up (a good time to write up these notes!), and then rebuild our JSON:

wd sparql term-members.sparql > wikidata.json

And then

bundle exec ruby check-data.rb wikipedia.csv wikidata.json | fgrep P4100 | wd aq --batch --summary "Add missing Parliamentary Groups"

This ran as https://tools.wmflabs.org/editgroups/b/wikibase-cli/75c15c646de5f/ and added another 70 claims.

Step 12: Add missing start+end dates

The Wikipedia table has a 'notes' field that includes start and end dates for some members. This isn't well structured, but it's consistently formatted, and there are only simple cases (e.g. no-one with both a start and end date), so we can adjust the scraper to also grab those fairly easily.

Then after rebuilding the CSV file (and refreshing the JSON):

bundle exec ruby check-data.rb wikipedia.csv wikidata.json | fgrep P58 | wd aq --batch --summary "Add missing start/end dates"

Runing that as https://tools.wmflabs.org/editgroups/b/wikibase-cli/80336e527eb3e/ gives 31 new qualifiers.

(I considered having a default start and end date based on the term period and filling those in for anyone who doesn't have something different in their note fields, but I'm not sure this table is reliable enough for that. It's one thing to trust that where data does appear it's probably accurate, but that's not quite the same as assuming that where data doesn't appear, that has an implied accuracy too.)

Step 13: Track down missing Party info

After all this our tracking page now shows 155 members, although it looks like around 30 of those don't have party information, so I'll need to investigate why those aren't getting picked up.

It looks like that was down to the scraper using old versions of some libraries. After fixing that, re-running the scraper finds IDs for several more parties, and a repeat of Step 11 gives https://tools.wmflabs.org/editgroups/b/wikibase-cli/4a554acefda4f/ with 31 more qualifiers.

Step 14: Clean up bare memberships

We've created some term-based P39 statements for people who already had a pre-existing "empty" P39 statement for simply having been a member of the Seimas at some unspecified time. As a final step we want to remove those bare statements that have now been supsersed by a more specific one.

Given our standard term-and-no-term.js template:

module.exports = membership => `
  SELECT DISTINCT ?item ?bare_ps ?term_ps WHERE {
    {
      SELECT DISTINCT ?item ?bare_ps WHERE {
        ?item p:P39 ?bare_ps .
        ?bare_ps ps:P39 wd:${membership} .
        FILTER NOT EXISTS {
          ?bare_ps ?pq_qual [] .
          ?pq_qual ^wikibase:qualifier [] .
        }
      }
    }

    {
      SELECT DISTINCT ?item ?term_ps WHERE {
        ?item p:P39 ?term_ps .
        ?term_ps ps:P39 wd:${membership} ; pq:P2937 [] .
      }
    }
  }
`

We can run:

wd sparql term-and-no-term-P39.js Q18507240 | jq -r '.[] | "\(.bare_ps)"' | sort | uniq | wd rc --batch --summary "Remove bare P39s where qualified one exists"

That ran as https://tools.wmflabs.org/editgroups/b/wikibase-cli/8d659f84964eb and removed 122 statements.

Step 15: Refresh tracking page

Our final version for today is at https://www.wikidata.org/w/index.php?title=Wikidata:WikiProject_every_politician/Lithuania/data/Seimas/11th&oldid=1230748440

A couple of people are still missing party information (at a quick glance those seem to be independents), and we don't have electoral district information yet, but that's still a lot nicer than where we were a couple of hours ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment