Skip to content

Instantly share code, notes, and snippets.

@VladimirAlexiev
Created March 30, 2017 06:54
Show Gist options
  • Save VladimirAlexiev/8201d614a819cb7d4023ce9aa315af65 to your computer and use it in GitHub Desktop.
Save VladimirAlexiev/8201d614a819cb7d4023ce9aa315af65 to your computer and use it in GitHub Desktop.
How to use Google Sheets to Manage Wikidata Coreferencing

How to use Google Sheets to Manage Wikidata Coreferencing

A previous post How to Add Museum IDs to Wikidata explained how to use SPARQL to find missing data on Wikidata (Getty Museum IDs), how to create such values (from museum webpage URLs) and how to format them properly for QuickStatements.

Here I explain how to use Google sheets to manage a more advanced task. The sheet AAT-Wikidata matches about 3k AAT concepts to Wikipedia, WordNet30 and BabelNet (it restored an old mapping to Wordnet, retrieved it from BabelNet, mapped to Wikipedia).

  • For each row, it uses the following Google sheet formula (column C) to query the Wikipedia API and get the corresponding Wikidata ID (wikibase_item); split on two lines for readability:
=ImportXml(concat("https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&
ppprop=wikibase_item&redirects=1&format=xml&titles=",G1),"//@wikibase_item")
  • This formula asks for results in XML format, and the part //@wikibase_item is a XPath that fetches the WD ID from the resulting XML.
  • Making 3k API calls is slow, so Google sheet initially shows "Loading…" for all rows, and gradually "materializes" the WD IDs (Qnnnn) as they come in. I have periodically sorted the column and used "Edit> Paste special> Values only" for the "materialized" IDs in order to fix them and not cause re-fetching next time when I open the google sheet.
  • Columns C,D,E are specially formatted to produce the required QuickStatements tab-delimited format. E.g. for row 62 AAT "patrol_wagons" corresponding to WP "Police_van": Q1023646 P1014 "300212831"

The benefit of Google sheets is that they allow easy addition of columns and convenient facilities for manual tasks:

  • Collaborative editing by several people at once.
  • Column A for tracking which rows are checked, which are already inserted to WD, etc
  • Using filters to find rows of interest. E.g. check=1 means rows that are manually checked and ready for insertion to QuickStatements. After insertion, I change it to check=2 to mark it as already inserted.
  • Column B for tracking already existing WD IDs
  • Conditional formatting to colour existing WD IDs (column B) that differ from my idea what is the matching WD ID (column C) and therefore must be checked.
@VladimirAlexiev
Copy link
Author

wd-gsheet-conditional-formatting

@VladimirAlexiev
Copy link
Author

To count in Wikidata instead of Wikipedia (answer to how-does-xpath-deal-with-xml-namespaces):

I use /*[name()='...'] in a google sheet to fetch some counts from Wikidata. I have a table like this

 thes    WD prop links   items
 NOM     P7749   3925    3789
 AAT     P1014   21157   20224

and the formulas in cols links and items are

=IMPORTXML("https://query.wikidata.org/sparql?query=SELECT(COUNT(*)as?c){?item wdt:"&$B14&"[]}","//*[name()='literal']")
=IMPORTXML("https://query.wikidata.org/sparql?query=SELECT(COUNT(distinct?item)as?c){?item wdt:"&$B14&"[]}","//*[name()='literal']")

respectively. The SPARQL query happens not to have any spaces...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment