Skip to content

Instantly share code, notes, and snippets.

@AbdealiLoKo
Created September 24, 2016 10:24
Show Gist options
  • Save AbdealiLoKo/05b8d2e6ded9bcb58e10deb16c7bacd5 to your computer and use it in GitHub Desktop.
Save AbdealiLoKo/05b8d2e6ded9bcb58e10deb16c7bacd5 to your computer and use it in GitHub Desktop.
WIkimedia Hackathon - Bits Pilani Hyderabad Campus
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SparQL as a Wikidata pywikibot generator"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Wikidata, complex queries can be performed because the data is stored in a structured way. [**SparQL**](https://en.wikipedia.org/wiki/SPARQL) is the querying language used by the wikibase technology (which drives wikidata).\n",
"\n",
"**SparQL** is meant to write queries to what is generally called (key-value) like data, which is exactly how Wikidata stores it's data (property, value) tuples. In general, it's a query language for RDF. [**RDF**](https://en.wikipedia.org/wiki/Resource_Description_Framework) (Resource Description Framework) is a W3C specificationn to write metadata model graphs. i.e. it helps in specifying a way to write some types of relational diagrams.\n",
"\n",
"To run and test SparQL queries on wikidata, a query service was created at https://query.wikidata.org - Use it while going through the tutorial."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# 1. Turtle\n",
"The basic building block of a SparQL query is an [RDF/turtle](<https://en.wikipedia.org/wiki/Turtle_(syntax)>). The full form of turtle is \"Terse RDF Triple Language\". It consists of a triplet or a 3-tuple where the items reresent a subject, a predicate and an object. In wikidata, we would say the three items are subject, property, value.\n",
"\n",
"For example, in wikidata, we can write the following turtles:\n",
" - [Python (Q28865)](<https://www.wikidata.org/wiki/Q28865>), [official site (P856)](<https://www.wikidata.org/wiki/Property:P856>), http://www.python.org\n",
" - [Douglas Adams (Q42)](<https://www.wikidata.org/wiki/Q42>), [instance of (P31)](<https://www.wikidata.org/wiki/Property:P31>), [human (Q5)](<https://www.wikidata.org/wiki/Q5>)\n",
"\n",
"In wikidata, the SparlQL have some special definitions (prefixes) which have been given pre-defined meanings. The `wdt:` and `wd:` prefixes:\n",
" - `wdt:` - The wdt prefix is used for a property. Example, `wdt:P856` is considered as the P856 property in wikidata.\n",
" - `wd:` - The wd prefix is used for entities or items. Example, `wd:Q42` is considered as the Q42 item in wikidata.\n",
"\n",
"These words can be changed and other prefixes can be defined by using `@prefix`. Hence, you can simply consider the following two lines are always added to every query by default:\n",
"\n",
" @prefix wd: <http://www.wikidata.org/entity/>\n",
" @prefix wdt: <http://www.wikidata.org/prop/direct/>\n",
"\n",
"\n",
"Hence, the above mentioned turtles will be written as the following in SparQL:\n",
" - [Python (Q28865)](<https://www.wikidata.org/wiki/Q28865>), [official site (P856)](<https://www.wikidata.org/wiki/Property:P856>), http://www.python.org -> wd:Q28865, wdt:P856, \"http://www.python.org\"\n",
" - [Douglas Adams (Q42)](<https://www.wikidata.org/wiki/Q42>), [instance of (P31)](<https://www.wikidata.org/wiki/Property:P31>), [human (Q5)](<https://www.wikidata.org/wiki/Q5>)\n",
"\n",
"The standard prefixes used by wikidata are:\n",
"\n",
" @prefix wd: <http://www.wikidata.org/entity/> \n",
" @prefix wdt: <http://www.wikidata.org/prop/direct/>\n",
" @prefix wikibase: <http://wikiba.se/ontology#>\n",
" @prefix p: <http://www.wikidata.org/prop/>\n",
" @prefix ps: <http://www.wikidata.org/prop/statement/>\n",
" @prefix pq: <http://www.wikidata.org/prop/qualifier/>\n",
" @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Writing a simple RDF query\n",
"Using turtles, we can define a basic query which fetches all items with a specific property value. The syntax for this is:\n",
"\n",
" SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . } LIMIT 100\n",
"\n",
"The word `item` is similar to a variable. The query above means \"Return all items, which are instance of human, limited to 100 items\". let us try fetching this data in pywikibot:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pywikibot\n",
"import pywikibot.pagegenerators as pagegen\n",
"from pprint import pprint\n",
"\n",
"wikidata = pywikibot.Site(\"wikidata\", \"wikidata\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"human_list = list(pagegen.WikidataSPARQLPageGenerator(\"SELECT ?item WHERE { ?item wdt:P31 wd:Q5 . } LIMIT 5\", site=wikidata))\n",
"pprint(human_list)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for human in human_list:\n",
" print(human, human.get()['labels']['en'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `pywikibotpagegenerators.WikidataSPARQLPageGenerator` function is restricted, as it can only accept queries which gives out a single ItemPage. It also expects the variabe name to be `?item`. But SparQL is considerably more flexible, as it can generate different types of output.\n",
"\n",
"For example, try the following query which should list all the places [Douglas Adams (Q42)](https://www.wikidata.org/wiki/Q42) was [educated at (P69)](https://www.wikidata.org/wiki/Property:P69):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pprint(list(pagegen.WikidataSPARQLPageGenerator(\"SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5\", site=wikidata)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This would give the `KeyError` saying that item was not found. Running the same query on https://query.wikidata.org gives the appropriate result. Run the next code block and click the \"Run\" button to see the query:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from IPython.display import IFrame\n",
"IFrame('https://query.wikidata.org/#SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5', width=\"100%\", height=\"400px\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Running generic SparQL queries in Pywikibot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pywikibot can also be used to run any generic SparQL queries using the `SparqlQuery` class:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from pywikibot.data.sparql import SparqlQuery\n",
"\n",
"wikiquery = SparqlQuery()\n",
"wikiquery.query('SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result given by the `SparqlQuery` is a bit raw and just gives the raw RDF converted to JSON. Hence normally the pywikibot API using `ItemPage` and `Claim` is an easier way to get data from the pages after creating the appropraite Page Generator.\n",
"\n",
"If you're sure that the value is going to be a SELECT query, then the `.select()` function is a much cleaner way to get the data as it parses the JSON and sanitizes it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"wikiquery.select('SELECT ?val WHERE { wd:Q42 wdt:P69 ?val . } LIMIT 5')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But the data here still gives the url given by RDF rather than the ItemPage, hence it is rather limited in functionaity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Resources\n",
"For a more elaborate RDF quide on SparQL check out https://commons.wikimedia.org/wiki/File:Wikidata%27s_SPARQL_introduction_presentation.pdf\n",
"\n",
"For the complete guide to wikidata's SparQL check out https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries\n",
"\n",
"Also, check out the example queries in https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Cats and https://query.wikidata.org/ to understand more complex queries."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment