For this example I’ve used DBpedia data about European cities, the political parties in their local governments and their ideologies. So big disclaimer, this data is not meant to be complete or more accurate than what can be expected from DBpedia/Wikipedia data. The DBpedia stores data using the RDF model so in order to query it we’ll have to use the SPARQL query language.
Here is the SPARQL query I’ve used to extract the cities and countries and the political parties currently in the local government. The query can be tested on the DBPedia SPARQL endpoint.
select distinct ?party ?city ?cityName ?ctr ?ctrName ?pop
where {
?city a dbo:Location ;
rdfs:label ?cityName ;
dbo:country ?ctr ;
dbo:leaderParty ?party .
FILTER(langMatches(lang(?cityName), "EN"))
?ctr dct:subject dbc:Countries_in_Europe ;
rdfs:label ?ctrName .
FILTER(langMatches(lang(?ctrName), "EN"))
optional { ?city dbo:populationTotal ?pop }
}
And a second SPARQL query that returns for each party, the ideologies they are associated with. Again, according to DBpedia/Wikipedia of the different political parties.
select distinct ?party ?partyName ?ideology ?ideologyName where {
?city a dbo:Location ;
dbo:country ?ctr ;
dbo:leaderParty ?party .
?ctr dct:subject dbc:Countries_in_Europe .
?party dbo:ideology ?ideology ;
rdfs:label ?partyName .
FILTER(langMatches(lang(?partyName), "EN"))
?ideology rdfs:label ?ideologyName
FILTER(langMatches(lang(?ideologyName), "EN"))
}
SPARQL queries of type SELECT return a set of variable bindings, or in other words, a tabular structure that the endpoint can serialise as CSV. This is quite convenient because it can be used directly by the LOAD CSV instruction in Cypher.
Here are the two Cypher statements that create the model in Neo4j. You may also want to create an index on city nodes first to get better performance:
CREATE INDEX ON :City(city_uri))
WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fparty+%3Fcity+%3FcityName+%3Fctr+%3FctrName+%3Fpop+where+%7B%0D%0A%3Fcity+a+dbo%3ALocation+%3B%0D%0A++++++dbo%3Acountry+%3Fctr+%3B%0D%0A++++++dbo%3AleaderParty+%3Fparty+.%0D%0A%0D%0A%3Fctr+dct%3Asubject+dbc%3ACountries_in_Europe+.%0D%0A%0D%0A%3Fparty+dbo%3Aideology+%3Fideology+.%0D%0A%0D%0Aoptional+%7B+%3Fcity+dbo%3ApopulationTotal+%3Fpop+%7D%0D%0A%0D%0A%3Fcity+rdfs%3Alabel+%3FcityName%0D%0AFILTER%28langMatches%28lang%28%3FcityName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%3Fctr+rdfs%3Alabel+%3FctrName%0D%0AFILTER%28langMatches%28lang%28%3FctrName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%7D&format=csv" AS queryOnSPARQLEndpoint
LOAD CSV WITH HEADERS FROM queryOnSPARQLEndpoint AS row
WITH row LIMIT 500
MERGE (cty:City {city_uri: row.city}) SET cty.city_pop= coalesce(row.pop,0), cty.city_name = row.cityName
MERGE (ctr:Country {ctry_uri: row.ctr}) SET ctr.ctry_name = row.ctrName
MERGE (pty:Party {party_uri: row.party})
MERGE (cty)-[:CTY_IN_COUNTRY]->(ctr)
MERGE (cty)-[:GOVERNING_PARTY]->(pty)
Note that the query runs directly off the DBpedia SPARQL endpoint (same for the next one) so if the endpoint is down for maintenance or any other reason, this graphgist won’t do much :)
Another point worth mentioning is the 'LIMIT 500' in the query. This is just to avoid hitting the limit when running the query as a GraphGist, but of course feel free to remove it if you run it on your own Neo4j instance.
WITH "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fparty+%3FpartyName+%3Fideology+%3FideologyName+where+%7B%0D%0A%0D%0A%3Fcity+a+dbo%3ALocation+%3B%0D%0A++++++dbo%3Acountry+%3Fctr+%3B%0D%0A++++++dbo%3AleaderParty+%3Fparty+.%0D%0A%0D%0A%3Fctr+dct%3Asubject+dbc%3ACountries_in_Europe+.%0D%0A%0D%0A%3Fparty+dbo%3Aideology+%3Fideology+%3B%0D%0A+++++++rdfs%3Alabel+%3FpartyName+.%0D%0AFILTER%28langMatches%28lang%28%3FpartyName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%3Fideology+rdfs%3Alabel+%3FideologyName%0D%0AFILTER%28langMatches%28lang%28%3FideologyName%29%2C+%22EN%22%29%29%0D%0A%0D%0A%7D&format=csv" AS queryOnSPARQLEndpoint
LOAD CSV WITH HEADERS FROM queryOnSPARQLEndpoint AS row
WITH row
MERGE (pty:Party {party_uri: row.party}) SET pty.party_name = row.partyName
MERGE (ide:Ideology {ideology_uri: row.ideology}) SET ide.ideology_name = row.ideologyName
MERGE (pty)-[:HAS_IDEOLOGY]->(ide)
And that’s pretty much it. The data in your graph should look something like this for a given city. Bern in Switzerland:
MATCH (ctr:Country)<-[cic:CTY_IN_COUNTRY]->(cty:City {city_name:'Bern'})-[gp:GOVERNING_PARTY]->(party:Party)-[hi:HAS_IDEOLOGY]->(id:Ideology) return *
Or like this for a few cities in Spain:
MATCH (ctr:Country {ctry_name: 'Spain'})<-[cic:CTY_IN_COUNTRY]->(cty:City) WITH ctr, cic, cty LIMIT 10
MATCH (cty)-[gp:GOVERNING_PARTY]->(party:Party)-[hi:HAS_IDEOLOGY]->(id:Ideology) return *
Now a couple of interesting queries:
MATCH (ctr:Country)<-[:CTY_IN_COUNTRY]-(city)-[:GOVERNING_PARTY]->(party)-[:HAS_IDEOLOGY]->(i:Ideology)
RETURN i.ideology_name, COUNT(DISTINCT(ctr)) AS presentInCountries, COLLECT(DISTINCT(ctr.ctry_name)) AS CountryList
ORDER BY presentInCountries DESC LIMIT 10
MATCH (p1:Party)-[:HAS_IDEOLOGY]->(i)<-[:HAS_IDEOLOGY]-(p2:Party)
WHERE (ID(p1) < ID(p2))
WITH p1, p2, COUNT(DISTINCT(i)) AS sharedIdeologyCount, COLLECT(DISTINCT(i.ideology_name)) AS sharedIdeologies
WHERE sharedIdeologyCount > 2
MATCH (p1)<-[:GOVERNING_PARTY]-()-[:CTY_IN_COUNTRY]->(ctr1), (p2)<-[:GOVERNING_PARTY]-()-[:CTY_IN_COUNTRY]->(ctr2)
WHERE ctr1 <> ctr2
RETURN DISTINCT p1.party_name, ctr1.ctry_name, p2.party_name, ctr2.ctry_name, sharedIdeologyCount, sharedIdeologies ORDER BY sharedIdeologyCount DESC LIMIT 10