anton5798/wikidata_intro.md Secret

## wikidata_intro.md

      
    Raw
  

              wikidata_intro.md
            
          
    Intro to Wikidata

Table of contents:

Intro to Wikidata

Notes and useful links
Structure

Data model
A note on qualifiers
Details on datatypes and datavalues


Example queries

Nepotism


References & acknowledgements
Some known oddities


Notes and useful links


Wikidata main page, and help page. I recommend getting familiar with
at least a couple of sections from the help page before reading this document.
Example entity: https://www.wikidata.org/wiki/Q23434. From here you can easily navigate to other entities, properties, and external pages to get a sense of what is going on.
Wikidata has a great interface for querying the KG in SPARQL + a bunch of example queries which you can easily try: https://query.wikidata.org/. Wikidata is more than just a set of triplets, which is how we usually think of a KG. Thus, SPARQL seems like a convenient tool to use with the Wikidata KG. However, I think having a SQL interface does help too.

Structure

Data model

Note: there's a wikidata data model page on its data model. The page says it is outdated, but it is still quite relevant for most parts. However, our tables are a subset of what wikidata has to offer.
The database has the following tables: entities, labels, claims, and qualifiers. We are not using descriptions, aliases, and references for now.


Table entities consists of all items and properties. Two possible values in the type column are thus either item or property. Value
column contains distinct values for properties, and is empty for items. It is somewhat unnatural that items and properties are in the same relation here, but this agrees with wikidata data model. Take a look at all wikidata properties online.
In the dump I used, there are 10060 properties, and about 93 million items. Below you'll see I created a separate MV for properties.


Table labels consists of labels for each property and item. There might be multiple labels for a single item/property in multiple languages. There's at most one english label per entity (others are in aliases) Column labels.id rerefences entities.id.


Table claims is the largest relation and stores all triplets (relations) for our KG. It has ~1.379 billion rows.


Table qualifiers consists of all qualifiers to claims, i.e properties of triplets. There are ~313 million rows in this table.


The schema is shown below. Some columns in the claims and qualifiers columns will most likely be rarely used, but they're there for the sake of completeness.
\d entities
 Column |          Type          | Collation | Nullable | Default
--------+------------------------+-----------+----------+---------
 id     | character varying(255) |           | not null |
 type   | text                   |           |          |
 value  | text                   |           |          |
Indexes:
    "entities_pk" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "labels" CONSTRAINT "entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
    TABLE "claims" CONSTRAINT "entity_id_fk" FOREIGN KEY (entity_id) REFERENCES entities(id)
    TABLE "claims" CONSTRAINT "property_fk" FOREIGN KEY (property) REFERENCES entities(id)
    TABLE "qualifiers" CONSTRAINT "property_fk" FOREIGN KEY (property) REFERENCES entities(id)
\d labels
  Column  |         Type          | Collation | Nullable | Default
----------+-----------------------+-----------+----------+---------
 id       | character varying(64) |           |          |
 language | character varying(16) |           |          |
 value    | text                  |           |          |
Indexes:
    "labels_id" btree (id)
    "labels_language" btree (language)
Foreign-key constraints:
    "entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
> \d claims
      Column      |          Type          | Collation | Nullable | Default
------------------+------------------------+-----------+----------+---------
 entity_id        | character varying(64)  |           |          |
 id               | character varying(255) |           | not null |
 type             | text                   |           |          |
 rank             | text                   |           |          |
 snaktype         | text                   |           |          |
 property         | text                   |           |          |
 datavalue_string | text                   |           |          |
 datavalue_entity | text                   |           |          |
 datavalue_date   | text                   |           |          |
 datavalue_type   | text                   |           |          |
 datatype         | text                   |           |          |
Indexes:
    "claims_pk" PRIMARY KEY, btree (id)
    "claims_datavalue_entity" btree (datavalue_entity)
    "claims_entity_id" btree (entity_id)
    "claims_property" btree (property)
Foreign-key constraints:
    "entity_id_fk" FOREIGN KEY (entity_id) REFERENCES entities(id)
    "property_fk" FOREIGN KEY (property) REFERENCES entities(id)
Referenced by:
    TABLE "qualifiers" CONSTRAINT "claim_id_fk" FOREIGN KEY (claim_id) REFERENCES claims(id)
> \d qualifiers
       Column       |          Type          | Collation | Nullable | Default
--------------------+------------------------+-----------+----------+---------
 claim_id           | character varying(255) |           |          |
 property           | text                   |           |          |
 hash               | text                   |           |          |
 snaktype           | text                   |           |          |
 qualifier_property | text                   |           |          |
 datavalue_string   | text                   |           |          |
 datavalue_entity   | text                   |           |          |
 datavalue_date     | text                   |           |          |
 datavalue_type     | text                   |           |          |
 datatype           | text                   |           |          |
 order_hash         | text                   |           |          |
Indexes:
    "qualifiers_datavalue_entity" btree (datavalue_entity)
    "qualifiers_idx_claim_id" btree (claim_id)
    "qualifiers_property" btree (property)
Foreign-key constraints:
    "claim_id_fk" FOREIGN KEY (claim_id) REFERENCES claims(id)
    "property_fk" FOREIGN KEY (property) REFERENCES entities(id)
You can see how these tables are intended to be joined by looking at PK/FK conditions.


I also created a bunch of helper tables to reduce the speed of some queries. For instance, labels_en includes all labels in english only. claims_dt_is_item includes all claims such that the datatype of the object is
another wikidata entity.


If you're wondering (like me) what a snak is, go to snak's wikidata page :)


A note on qualifiers

Qualifiers are, in essence, properties of triplets. E.g, you can have something like
((Ernest Hemingway, residence, Key West), start-time, 1931), ((Ernest Hemingway, residence, Key West), end-time, 1939), etc.
Details on datatypes and datavalues

Now, depending on the datatype, we can have either a datavalue_string column populated, or datavalue_entity populated.
This applies to both claims and qualifiers.
This data model is slightly different from usual KG, in which all there is to a KG is just a set of triplets corresponding to relations between entities. One of the differences is that subject is always an entity, predicate is always a property, but object can be one of datatypes. The datatype is specified in property.value column. (I created a MV called properties to be a subset of entities consisting only of properties.) So a property can only have 1 datatype (i.e datatype of the object).
There are 17 datatypes:
> select value, count(*) from properties group by value order by count desc;

       value       | count
-------------------+-------
 external-id       |  7158
 wikibase-item     |  1548
 quantity          |   633
 string            |   322
 url               |    87
 commonsMedia      |    75
 time              |    62
 monolingualtext   |    59
 math              |    34
 wikibase-property |    19
 wikibase-sense    |    16
 wikibase-lexeme   |    14
 globe-coordinate  |    12
 musical-notation  |     6
 wikibase-form     |     6
 tabular-data      |     6
 geo-shape         |     3
(17 rows)
The way columns of claims table are populated depends on the datatype field of a particular claim.
Most of the properties have datatype external-id, which points to some external identifier (e.g, with Hemingway example from the top of the doc, about 2/3 of the web page is actually devoted to external ID properties, which are, for examples, Hemingway's id numbers in various libraries in the world, different catalogues etc.).
Below are a couple of examples on how the tuples in claims will look like for different datatypes.

Example 1: datatype = 'wikibase-item'. This is the only case in which datavalue_entity will be populated, and will
be equal to the id of the underlying item. E.g:

id = 'Q31$bad6229c-40b6-86fc-5a1c-774ba4651a25'
...
property = 'P361'
datavalue_string = ''
datavalue_entity = 'Q458'
datavalue_date = 'nodate'
datavalue_type = 'wikibase-entityid'
datatype = 'wikibase-item'

Example 2: datatype = 'quantity'. In this case, the claim structure will look like below. The datavalue_string field is essentially a dictionary of arbitrary fields,
depending on the datatype. Even within the datatype the fields can differ (explore the data for more). Note that, for instance, field unit can be either as simple as just string "1" (e.g when talking about population count), or refer to other Wikidate entities by link, e.g "http://www.wikidata.org/entity/Q24564698", which stands for "years old".

id = 'Q31$169679CA-2C2E-4FE5-83FD-6AC1E0F19E47'
...
property = 'P2046'
datavalue_string = '{"amount"=>"+30528", "unit"=>"http://www.wikidata.org/entity/Q712226", "upperBound"=>"+30529", "lowerBound"=>"+30527"}'
datavalue_entity = ''
datavalue_date = 'nodate'
datavalue_type = 'quantity'
datatype = 'quantity'

Example 3: datatype = 'url'.

id = 'Q2013$C3FCE8A5-D256-489F-8CEB-A9237975E787'
...
property = 'P2699'
datavalue_string = 'https://www.wikidata.org/wiki/Wikidata:Main_Page'
datavalue_entity = ''
datavalue_date = 'nodate'
datavalue_type = 'string'
datatype = 'url'
Example queries

Nepotism

I manually constructed and ran the nepotism query: find all recipients of awards that were given by an organization whose
president is/was a father/mother to the recipient. The query is given below -- and it finds 5 tuples!
A quick explanation of the query: I use 4 properties: P166 (award recieved), P1027 (conferred by), P488 (chairperson), P40 (child), and create a 4-cycle out of them. The remainder of the query is just getting the labels of the entities' ids for readability.
There's a slight problem though. Of course, we don't check that the timestamp when the award was given falls indeed
on the predicency of our particular person. In fact, this is not always the case. Once we have the specific query,
it is relatively easy to also add necessary qualifiers (they are present in the data) to check the validiy of the timestamps.
Since the qualifiers properties are not given as separate fields, but so far encoded as a string, it's not trivial to write
an efficient query. I'm thinking of ways around it.
WITH nepotism as 
	(SELECT c1.entity_id as recipient, c2.entity_id as award, c3.entity_id as org, c4.entity_id as chairsperson 
	FROM claims c1, claims c2, claims c3, claims c4
	WHERE c1.property='P166' and c2.property='P1027' and c3.property = 'P488' and c4.property='P40'
	AND c1.datavalue_entity = c2.entity_id AND c2.datavalue_entity = c3.entity_id 
	AND c3.datavalue_entity = c4.entity_id AND c4.datavalue_entity = c1.entity_id
	)
SELECT 
	n.*,
	l4.value as chairsperson_val ,
	l2.value as award_val,
	l3.value as org_val,
	l1.value as recipient_val
FROM nepotism n
LEFT JOIN labels_en l4 ON l4.id = n.chairsperson 
LEFT JOIN labels_en l2 ON l2.id = n.award
LEFT JOIN labels_en l3 ON l3.id = n.org
LEFT JOIN labels_en l1 ON l1.id = n.recipient;
Result:
 recipient  |   award    |   org    | chairsperson |  chairsperson_val   |                 award_val                 |             org_val              |    recipient_val
------------+------------+----------+--------------+---------------------+-------------------------------------------+----------------------------------+---------------------
 Q334030    | Q28835     | Q828490  | Q982534      | Jacques-Louis Lions | Fields medal                              | International Mathematical Union | Pierre-Louis Lions
 Q30320396  | Q65372649  | Q2822319 | Q35162257    | Marcel Gouron       | Associate Member of the Académie de Nîmes | Académie de Nîmes                | André Gouron
 Q2149844   | Q18600746  | Q641439  | Q2650268     | Alois Hundhammer    | Bayerische Verfassungsmedaille in silver  | Landtag of Bavaria               | Richard Hundhammer
 Q47005313  | Q50845443  | Q2822319 | Q40236025    | Robert Chamboredon  | prix Sydney-Forado                        | Académie de Nîmes                | David Chamboredon
 Q102291682 | Q102294596 | Q188771  | Q177709      | Hubert Curien       |                                           | French Academy of Sciences       | Pierre-Louis Curien
(5 rows)
References & acknowledgements


I used https://topicseed.com/blog/importing-wikidata-dumps/ for some inspiration/help on loading wikidata.
I used https://users.dcc.uchile.cl/~dhernand/wquery/#translate-the-data-to-the-sql-data-model, and in particular their github repo to parse the json dump and turn it into nice csvs. I mainly used script in postgresql-experiment-scripts/load-data called migrador.rb. The code is from 2016, but by looking at what the code does and at the format of the JSON, it looks like the format hasn't changed much since then. There might be better ways of doing it, but it'll require a separate time investment.

Some known oddities


There are sometimes strange claims that look like this. I.e, datatype is url, but datavalue_type is wikibase-entityid, and the entity is missing (novalue), as well as
datavalue_string is novalue. According to the website, there are "unknown values" sometimes (go to this entity and search for "official website", which is P856).

entity_id        | Q50376871
id               | Q50376871$341f9e2a-470a-ba01-4ddb-cb7e05de4873
type             | statement
rank             | normal
snaktype         | somevalue
property         | P856
datavalue_string | novalue
datavalue_entity |
datavalue_date   | nodate
datavalue_type   | wikibase-entityid
datatype         | url