Skip to content

Instantly share code, notes, and snippets.

@psychemedia
Last active August 29, 2015 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psychemedia/84ee421ffac30fee2ca6 to your computer and use it in GitHub Desktop.
Save psychemedia/84ee421ffac30fee2ca6 to your computer and use it in GitHub Desktop.
Quick test of British National Bibliography Linked Open Data Endpoint
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:6b3600fad6085de44a2ae5d36de4f0765f829ed6bfb2fa7b7322727d1ce98bd9"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"British National Bibliography - Linked Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [British National Bibliography (BNB) Linked Data Platform](http://bnb.data.bl.uk/) provides access to the British National Bibliography, a comprehensive database detailing books and serials published in the UK since 1950.\n",
"\n",
"This notebook will show how to start constructing queries over this service using SPARQL and then parsing the returned data into a *pandas* DataFrame.\n",
"\n",
"The SPARQL endpoint for the service can be found at: [http://bnb.data.bl.uk/sparql](http://bnb.data.bl.uk/sparql)\n",
"\n",
"A [bulk data download](http://www.bl.uk/bibliographic/download.html#lodbnb) of the Linked Data is also available.\n",
"\n",
"Example queries can be found here:\n",
"\n",
"- [getting Started with the BNB as Linked Open Data](http://bnb.data.bl.uk/getting-started)\n",
"- [BNB example queries](http://ldodds.github.io/bnb-queries/index.html) [Leigh Dodds]"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Install a library to help us run some SPARQL queries if we haven't already installed it\n",
"#http://rdflib.github.io/sparqlwrapper/\n",
"!pip3 uninstall -y sparqlwrapper\n",
"!pip3 install sparqlwrapper\n",
"\n",
"#NOTE: if you find the SPARQL queries slowing down, or throwing an error message, try the following:\n",
"## 1) Save your notebook.\n",
"## 2) Close it.\n",
"## 3) Shut it down.\n",
"#This should reset sparqlwrapper\n",
"## 4) Restart the notebook.\n",
"# You will need to run the cells again to load packages, reset state etc, becuase you will have started a new IPython process."
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Import the necessary packages\n",
"from SPARQLWrapper import SPARQLWrapper, JSON"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Declare the BNB endpoint\n",
"endpoint=\"http://bnb.data.bl.uk/sparql\"\n",
"sparql = SPARQLWrapper(endpoint)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#My experience of SPARQL is that things work then they don't and you have no idea which bit is broken\n",
"#This test should work. It really should. It has before. And it shouldn't take too long.\n",
"#It comes from http://bnb.data.bl.uk/getting-started\n",
"q='''PREFIX bibo: <http://purl.org/ontology/bibo/>\n",
"PREFIX bio: <http://purl.org/vocab/bio/0.1/>\n",
"PREFIX blt: <http://www.bl.uk/schemas/bibliographic/blterms#>\n",
"PREFIX dct: <http://purl.org/dc/terms/>\n",
"PREFIX event: <http://purl.org/NET/c4dm/event.owl#>\n",
"PREFIX foaf: <http://xmlns.com/foaf/0.1/>\n",
"PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>\n",
"PREFIX isbd: <http://iflastandards.info/ns/isbd/elements/>\n",
"PREFIX org: <http://www.w3.org/ns/org#>\n",
"PREFIX owl: <http://www.w3.org/2002/07/owl#>\n",
"PREFIX rda: <http://rdvocab.info/ElementsGr2/>\n",
"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX skos: <http://www.w3.org/2004/02/skos/core#>\n",
"PREFIX void: <http://rdfs.org/ns/void#>\n",
"PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
"\n",
"SELECT ?book ?bnb ?title WHERE {\n",
" #Match the book by ISBN\n",
" ?book bibo:isbn13 \"9780729408745\";\n",
" #bind some variables to its other attributes\n",
" blt:bnb ?bnb;\n",
" dct:title ?title. }'''\n",
"sparql.setQuery(q)\n",
"sparql.setReturnFormat(JSON)\n",
"results = sparql.query().convert()\n",
"results"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assuming that the above test - using a provided example query - works, we can start to construct our own queries.\n",
"\n",
"We'll follow the suggested approach of building up a comprehensive prefix statement that we can make use of in any query to the endpoint. If we come across further useful prefixes, we can add them in."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Declare a standard, if exhaustive, list of prefixes we can apply to each query\n",
"#Don't leave white space on the left hand side...\n",
"prefix='''\n",
"PREFIX bibo: <http://purl.org/ontology/bibo/>\n",
"PREFIX bio: <http://purl.org/vocab/bio/0.1/>\n",
"PREFIX blt: <http://www.bl.uk/schemas/bibliographic/blterms#>\n",
"PREFIX dct: <http://purl.org/dc/terms/>\n",
"PREFIX event: <http://purl.org/NET/c4dm/event.owl#>\n",
"PREFIX foaf: <http://xmlns.com/foaf/0.1/>\n",
"PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>\n",
"PREFIX isbd: <http://iflastandards.info/ns/isbd/elements/>\n",
"PREFIX org: <http://www.w3.org/ns/org#>\n",
"PREFIX owl: <http://www.w3.org/2002/07/owl#>\n",
"PREFIX rda: <http://rdvocab.info/ElementsGr2/>\n",
"PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX skos: <http://www.w3.org/2004/02/skos/core#>\n",
"PREFIX void: <http://rdfs.org/ns/void#>\n",
"PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
" \n",
"'''"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reviewing some of the other example queries, we can identify some useful query fragments.\n",
"\n",
"For example, we note that a book may have a creator, and a title; and that a creator may have a name. If we piece these together, we should be able to get the titles of books created by a particular person."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's just test a simple query\n",
"#Search for books by author name\n",
"q='''\n",
"SELECT DISTINCT ?book ?title WHERE {\n",
" ?book dct:creator ?author ;\n",
" dct:title ?title.\n",
" ?author foaf:name \"Iain Banks\".\n",
"} LIMIT 5\n",
"'''"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Run the query, parse the response as JSON, and get them into a variable\n",
"sparql.setQuery(prefix+q)\n",
"sparql.setReturnFormat(JSON)\n",
"results = sparql.query().convert()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to get the data back as triples, represented using JSON."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Here's what the response looks like\n",
"results"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's specify the response columns we want to display\n",
"answerCols=['book','title']"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#We can then iterate through these\n",
"for result in results[\"results\"][\"bindings\"]:\n",
" for ans in answerCols:\n",
" print(result[ans]['value'], end=\" \")\n",
" print()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's make a function to handle that a little more tidily\n",
"def printResults(results,ansCols):\n",
" ''' Print the required results column values from the SPARQL query '''\n",
" for result in results[\"results\"][\"bindings\"]:\n",
" for ans in answerCols:\n",
" print(result[ans]['value'], end=\" \")\n",
" print()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"printResults(results,answerCols)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's do a little more wrapping\n",
"def runQuery(endpoint,prefix,q):\n",
" ''' Run a SPARQL query with a declared prefix over a specified endpoint '''\n",
" sparql = SPARQLWrapper(endpoint)\n",
" sparql.setQuery(prefix+q)\n",
" sparql.setReturnFormat(JSON)\n",
" return sparql.query().convert()\n",
"\n",
"def queryResults(endpoint,prefix,q,ansCols):\n",
" ''' Run a SPARQL query with a declared prefix over a specified endpoint and print the required results columns '''\n",
" results=runQuery(endpoint,prefix,q)\n",
" printResults(results,ansCols)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"queryResults(endpoint,prefix,q,answerCols)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's see what the results look like\n",
"results"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Some endpoints will return data in other formats, for example flattened as a CSV data table\n",
"#We can flatten the data ourselves in an ad hoc way and get it into a pandas datatable\n",
"\n",
"import pandas as pd\n",
"\n",
"#pandas may have a better way of doing this?!\n",
"data=[]\n",
"for result in results[\"results\"][\"bindings\"]:\n",
" tmp={}\n",
" for el in result:\n",
" tmp[el]=result[el]['value']\n",
" data.append(tmp)\n",
" #Note that we lise the type information which we could have used to type the columns in the final dataframe\n",
"\n",
"df = pd.DataFrame(data)\n",
"df"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Let's wrap everything up\n",
"def dict2df(results):\n",
" ''' Hack a function to flatten the SPARQL query results and return the column values '''\n",
" data=[]\n",
" for result in results[\"results\"][\"bindings\"]:\n",
" tmp={}\n",
" for el in result:\n",
" tmp[el]=result[el]['value']\n",
" data.append(tmp)\n",
"\n",
" df = pd.DataFrame(data)\n",
" return df\n",
"\n",
"def dfResults(endpoint,prefix,q):\n",
" ''' Generate a data frame containing the results of running\n",
" a SPARQL query with a declared prefix over a specified endpoint '''\n",
" return dict2df( runQuery( endpoint, prefix, q ) )"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dfResults(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Learning More About a Resource"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A good way of inspecting the properties associated resource is to use the `DESCRIBE` command."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='DESCRIBE ?book WHERE { ?book bibo:isbn10 \"1857232356\" }'\n",
"ans=runQuery(endpoint,prefix,q)\n",
"ans"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trying to run this query gives me the forllowing warning:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Format requested was JSON, but RDF/XML (application/rdf+xml;charset=UTF-8) has been returned by the endpoint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For some reason, the endpoint doesn't seem to want to provide a JSON represented result for the DESCRIBE query - so we will have to handle the RDF response ourselves."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we try to serialise this response as a set of N-triples we are presented with a bytestream."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ans.serialize(format=\"nt\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get a clearer(?!) view by decoding the bytestream as a UTF-8 string and then printing the result."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(ans.serialize(format=\"nt\").decode(\"utf-8\"))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#For convenience, let's just bundle that up in case we need to call it again\n",
"def printDesc(endpoint,prefix,q):\n",
" ans=runQuery(endpoint,prefix,q)\n",
" print(ans.serialize(format=\"nt\").decode(\"utf-8\"))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='DESCRIBE ?book WHERE { ?book bibo:isbn10 \"1857232356\" }'\n",
"printDesc(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's pull back some information about a book with a given ISBN:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='''\n",
"SELECT ?book ?bnb ?publicationEvent ?title ?creator WHERE {\n",
" #Match the book by ISBN\n",
" ?book bibo:isbn10 \"1857232356\";\n",
" \n",
" #bind some variables to other attributes of the work\n",
" \n",
" #Get the British National Bibliography number\n",
" blt:bnb ?bnb;\n",
" \n",
" #Identify the publication event associated with this work\n",
" blt:publication ?publicationEvent;\n",
" \n",
" #Identify the title of the work\n",
" dct:title ?title;\n",
" \n",
" #Identify the creator of the work\n",
" dct:creator ?creator.\n",
" }\n",
"'''\n",
"runQuery(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What other sorts of thing are we able to find out about the creator?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='''\n",
"SELECT DISTINCT ?property \n",
"where {\n",
" ?book bibo:isbn10 \"1857232356\";\n",
" dct:creator ?creator.\n",
" ?creator ?property ?x\n",
"}\n",
"'''\n",
"runQuery(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do you think you might be able to use any of this information to add the author's name into the response?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='''\n",
"SELECT ?book ?isbn10 ?bnb ?title ?author WHERE {\n",
" #Match the book by ISBN\n",
" ?book bibo:isbn10 \"1857232356\";\n",
" \n",
" #bind some variables to its other attributes\n",
" blt:bnb ?bnb;\n",
" dct:title ?title;\n",
" bibo:isbn10 ?isbn10;\n",
" \n",
" dct:creator ?creator.\n",
" \n",
" ?creator foaf:name ?author.\n",
" }\n",
"'''\n",
"dfResults(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now see what you can find out about the publication event."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#YOUR INVESTIGATION HERE"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what I found:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='''\n",
"SELECT DISTINCT ?a ?b WHERE {\n",
" <http://bnb.data.bl.uk/id/resource/012701972/publicationevent/LondonOrbit1994> ?a ?b.\n",
"}\n",
"'''\n",
"dfResults(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What can we learn about the date?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='''\n",
"SELECT DISTINCT ?a ?b WHERE {\n",
" <http://reference.data.gov.uk/id/year/1994> ?a ?b.\n",
"}\n",
"'''\n",
"dfResults(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Putting these various pieces together, I should now be able to search for books published by Ian Banks between two dates, using a `FILTER` command to prune the results to show only books published between two dates.\n",
"\n",
"We can further tidy the way the results are presented by ordering the results according to publication date using the `ORDER BY` limit."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"q='''\n",
"SELECT DISTINCT ?book ?title ?date WHERE {\n",
" #Find books by 'Iain Banks':\n",
" ?book dct:creator ?author ;\n",
" dct:title ?title.\n",
" ?author foaf:name \"Iain Banks\".\n",
" \n",
" #Find when they were published:\n",
" ?book blt:publication ?publicationEvent.\n",
" ?publicationEvent event:time ?eventTime.\n",
" ?eventTime rdfs:label ?date.\n",
" \n",
" #Look for books published between 1985 and 1990\n",
" FILTER (?date>=\"1985\" && ?date<\"1990\")\n",
"} ORDER BY ?date \n",
"'''\n",
"dfResults(endpoint,prefix,q)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you have constructed a useful query, you might consider wrapping it within a reusable function. For example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def getBooksByAuthorBetweenDates(author,fromDate,toDate):\n",
" q='''\n",
" SELECT DISTINCT ?book ?title ?date WHERE {{\n",
" #Find books by name:\n",
" ?book dct:creator ?author ;\n",
" dct:title ?title.\n",
" ?author foaf:name \"{0}\".\n",
"\n",
" #Find when they were published:\n",
" ?book blt:publication ?publicationEvent.\n",
" ?publicationEvent event:time ?eventTime.\n",
" ?eventTime rdfs:label ?date.\n",
"\n",
" #Look for books published between dates\n",
" FILTER (?date>=\"{1}\" && ?date<=\"{2}\")\n",
" }} ORDER BY ?date\n",
" '''.format(author,fromDate,toDate)\n",
" return dfResults(endpoint,prefix,q)\n",
"\n",
"getBooksByAuthorBetweenDates(\"Terry Pratchett\",\"1985\",\"1987\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"What Next?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook has shown how to start working with the British National Bibliography Linked Data platform."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment