Table of contents:
- Wikidata main page, and help page. I recommend getting familiar with at least a couple of sections from the help page before reading this document.
- Example entity: https://www.wikidata.org/wiki/Q23434. From here you can easily navigate to other entities, properties, and external pages to get a sense of what is going on.
- Wikidata has a great interface for querying the KG in SPARQL + a bunch of example queries which you can easily try: https://query.wikidata.org/. Wikidata is more than just a set of triplets, which is how we usually think of a KG. Thus, SPARQL seems like a convenient tool to use with the Wikidata KG. However, I think having a SQL interface does help too.
Note: there's a wikidata data model page on its data model. The page says it is outdated, but it is still quite relevant for most parts. However, our tables are a subset of what wikidata has to offer.
The database has the following tables: entities
, labels
, claims
, and qualifiers
. We are not using descriptions, aliases, and references for now.
-
Table
entities
consists of all items and properties. Two possible values in the type column are thus eitheritem
orproperty
. Value column contains distinct values for properties, and is empty for items. It is somewhat unnatural that items and properties are in the same relation here, but this agrees with wikidata data model. Take a look at all wikidata properties online. In the dump I used, there are 10060 properties, and about 93 million items. Below you'll see I created a separate MV for properties. -
Table
labels
consists of labels for each property and item. There might be multiple labels for a single item/property in multiple languages. There's at most one english label per entity (others are in aliases) Column labels.id rerefences entities.id. -
Table
claims
is the largest relation and stores all triplets (relations) for our KG. It has ~1.379 billion rows. -
Table
qualifiers
consists of all qualifiers to claims, i.e properties of triplets. There are ~313 million rows in this table.
The schema is shown below. Some columns in the claims
and qualifiers
columns will most likely be rarely used, but they're there for the sake of completeness.
\d entities
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+---------
id | character varying(255) | | not null |
type | text | | |
value | text | | |
Indexes:
"entities_pk" PRIMARY KEY, btree (id)
Referenced by:
TABLE "labels" CONSTRAINT "entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
TABLE "claims" CONSTRAINT "entity_id_fk" FOREIGN KEY (entity_id) REFERENCES entities(id)
TABLE "claims" CONSTRAINT "property_fk" FOREIGN KEY (property) REFERENCES entities(id)
TABLE "qualifiers" CONSTRAINT "property_fk" FOREIGN KEY (property) REFERENCES entities(id)
\d labels
Column | Type | Collation | Nullable | Default
----------+-----------------------+-----------+----------+---------
id | character varying(64) | | |
language | character varying(16) | | |
value | text | | |
Indexes:
"labels_id" btree (id)
"labels_language" btree (language)
Foreign-key constraints:
"entities_fk" FOREIGN KEY (id) REFERENCES entities(id)
> \d claims
Column | Type | Collation | Nullable | Default
------------------+------------------------+-----------+----------+---------
entity_id | character varying(64) | | |
id | character varying(255) | | not null |
type | text | | |
rank | text | | |
snaktype | text | | |
property | text | | |
datavalue_string | text | | |
datavalue_entity | text | | |
datavalue_date | text | | |
datavalue_type | text | | |
datatype | text | | |
Indexes:
"claims_pk" PRIMARY KEY, btree (id)
"claims_datavalue_entity" btree (datavalue_entity)
"claims_entity_id" btree (entity_id)
"claims_property" btree (property)
Foreign-key constraints:
"entity_id_fk" FOREIGN KEY (entity_id) REFERENCES entities(id)
"property_fk" FOREIGN KEY (property) REFERENCES entities(id)
Referenced by:
TABLE "qualifiers" CONSTRAINT "claim_id_fk" FOREIGN KEY (claim_id) REFERENCES claims(id)
> \d qualifiers
Column | Type | Collation | Nullable | Default
--------------------+------------------------+-----------+----------+---------
claim_id | character varying(255) | | |
property | text | | |
hash | text | | |
snaktype | text | | |
qualifier_property | text | | |
datavalue_string | text | | |
datavalue_entity | text | | |
datavalue_date | text | | |
datavalue_type | text | | |
datatype | text | | |
order_hash | text | | |
Indexes:
"qualifiers_datavalue_entity" btree (datavalue_entity)
"qualifiers_idx_claim_id" btree (claim_id)
"qualifiers_property" btree (property)
Foreign-key constraints:
"claim_id_fk" FOREIGN KEY (claim_id) REFERENCES claims(id)
"property_fk" FOREIGN KEY (property) REFERENCES entities(id)
You can see how these tables are intended to be joined by looking at PK/FK conditions.
-
I also created a bunch of helper tables to reduce the speed of some queries. For instance,
labels_en
includes all labels in english only.claims_dt_is_item
includes all claims such that the datatype of the object is another wikidata entity. -
If you're wondering (like me) what a snak is, go to snak's wikidata page :)
Qualifiers are, in essence, properties of triplets. E.g, you can have something like
((Ernest Hemingway, residence, Key West), start-time, 1931)
, ((Ernest Hemingway, residence, Key West), end-time, 1939)
, etc.
Now, depending on the datatype, we can have either a datavalue_string column populated, or datavalue_entity populated.
This applies to both claims
and qualifiers
.
This data model is slightly different from usual KG, in which all there is to a KG is just a set of triplets corresponding to relations between entities. One of the differences is that subject is always an entity
, predicate is always a property
, but object can be one of datatypes. The datatype is specified in property.value column. (I created a MV called properties
to be a subset of entities
consisting only of properties.) So a property can only have 1 datatype (i.e datatype of the object).
There are 17 datatypes:
> select value, count(*) from properties group by value order by count desc;
value | count
-------------------+-------
external-id | 7158
wikibase-item | 1548
quantity | 633
string | 322
url | 87
commonsMedia | 75
time | 62
monolingualtext | 59
math | 34
wikibase-property | 19
wikibase-sense | 16
wikibase-lexeme | 14
globe-coordinate | 12
musical-notation | 6
wikibase-form | 6
tabular-data | 6
geo-shape | 3
(17 rows)
The way columns of claims
table are populated depends on the datatype field of a particular claim.
Most of the properties have datatype external-id, which points to some external identifier (e.g, with Hemingway example from the top of the doc, about 2/3 of the web page is actually devoted to external ID properties, which are, for examples, Hemingway's id numbers in various libraries in the world, different catalogues etc.).
Below are a couple of examples on how the tuples in claims
will look like for different datatypes.
- Example 1: datatype = 'wikibase-item'. This is the only case in which datavalue_entity will be populated, and will be equal to the id of the underlying item. E.g:
id = 'Q31$bad6229c-40b6-86fc-5a1c-774ba4651a25'
...
property = 'P361'
datavalue_string = ''
datavalue_entity = 'Q458'
datavalue_date = 'nodate'
datavalue_type = 'wikibase-entityid'
datatype = 'wikibase-item'
- Example 2: datatype = 'quantity'. In this case, the claim structure will look like below. The datavalue_string field is essentially a dictionary of arbitrary fields, depending on the datatype. Even within the datatype the fields can differ (explore the data for more). Note that, for instance, field unit can be either as simple as just string "1" (e.g when talking about population count), or refer to other Wikidate entities by link, e.g "http://www.wikidata.org/entity/Q24564698", which stands for "years old".
id = 'Q31$169679CA-2C2E-4FE5-83FD-6AC1E0F19E47'
...
property = 'P2046'
datavalue_string = '{"amount"=>"+30528", "unit"=>"http://www.wikidata.org/entity/Q712226", "upperBound"=>"+30529", "lowerBound"=>"+30527"}'
datavalue_entity = ''
datavalue_date = 'nodate'
datavalue_type = 'quantity'
datatype = 'quantity'
- Example 3: datatype = 'url'.
id = 'Q2013$C3FCE8A5-D256-489F-8CEB-A9237975E787'
...
property = 'P2699'
datavalue_string = 'https://www.wikidata.org/wiki/Wikidata:Main_Page'
datavalue_entity = ''
datavalue_date = 'nodate'
datavalue_type = 'string'
datatype = 'url'
I manually constructed and ran the nepotism query: find all recipients of awards that were given by an organization whose president is/was a father/mother to the recipient. The query is given below -- and it finds 5 tuples!
A quick explanation of the query: I use 4 properties: P166 (award recieved), P1027 (conferred by), P488 (chairperson), P40 (child), and create a 4-cycle out of them. The remainder of the query is just getting the labels of the entities' ids for readability.
There's a slight problem though. Of course, we don't check that the timestamp when the award was given falls indeed on the predicency of our particular person. In fact, this is not always the case. Once we have the specific query, it is relatively easy to also add necessary qualifiers (they are present in the data) to check the validiy of the timestamps. Since the qualifiers properties are not given as separate fields, but so far encoded as a string, it's not trivial to write an efficient query. I'm thinking of ways around it.
WITH nepotism as
(SELECT c1.entity_id as recipient, c2.entity_id as award, c3.entity_id as org, c4.entity_id as chairsperson
FROM claims c1, claims c2, claims c3, claims c4
WHERE c1.property='P166' and c2.property='P1027' and c3.property = 'P488' and c4.property='P40'
AND c1.datavalue_entity = c2.entity_id AND c2.datavalue_entity = c3.entity_id
AND c3.datavalue_entity = c4.entity_id AND c4.datavalue_entity = c1.entity_id
)
SELECT
n.*,
l4.value as chairsperson_val ,
l2.value as award_val,
l3.value as org_val,
l1.value as recipient_val
FROM nepotism n
LEFT JOIN labels_en l4 ON l4.id = n.chairsperson
LEFT JOIN labels_en l2 ON l2.id = n.award
LEFT JOIN labels_en l3 ON l3.id = n.org
LEFT JOIN labels_en l1 ON l1.id = n.recipient;
Result:
recipient | award | org | chairsperson | chairsperson_val | award_val | org_val | recipient_val
------------+------------+----------+--------------+---------------------+-------------------------------------------+----------------------------------+---------------------
Q334030 | Q28835 | Q828490 | Q982534 | Jacques-Louis Lions | Fields medal | International Mathematical Union | Pierre-Louis Lions
Q30320396 | Q65372649 | Q2822319 | Q35162257 | Marcel Gouron | Associate Member of the Académie de Nîmes | Académie de Nîmes | André Gouron
Q2149844 | Q18600746 | Q641439 | Q2650268 | Alois Hundhammer | Bayerische Verfassungsmedaille in silver | Landtag of Bavaria | Richard Hundhammer
Q47005313 | Q50845443 | Q2822319 | Q40236025 | Robert Chamboredon | prix Sydney-Forado | Académie de Nîmes | David Chamboredon
Q102291682 | Q102294596 | Q188771 | Q177709 | Hubert Curien | | French Academy of Sciences | Pierre-Louis Curien
(5 rows)
- I used https://topicseed.com/blog/importing-wikidata-dumps/ for some inspiration/help on loading wikidata.
- I used https://users.dcc.uchile.cl/~dhernand/wquery/#translate-the-data-to-the-sql-data-model, and in particular their github repo to parse the json dump and turn it into nice csvs. I mainly used script in
postgresql-experiment-scripts/load-data
calledmigrador.rb
. The code is from 2016, but by looking at what the code does and at the format of the JSON, it looks like the format hasn't changed much since then. There might be better ways of doing it, but it'll require a separate time investment.
- There are sometimes strange claims that look like this. I.e, datatype is url, but datavalue_type is wikibase-entityid, and the entity is missing (novalue), as well as datavalue_string is novalue. According to the website, there are "unknown values" sometimes (go to this entity and search for "official website", which is P856).
entity_id | Q50376871
id | Q50376871$341f9e2a-470a-ba01-4ddb-cb7e05de4873
type | statement
rank | normal
snaktype | somevalue
property | P856
datavalue_string | novalue
datavalue_entity |
datavalue_date | nodate
datavalue_type | wikibase-entityid
datatype | url