What do you do when a new colleague starts to talk to you about how they would love to experiment with getting a dataset about Romeo & Juliet into a graph? Yes, that's right, you get your graph boots on, and you start looking out for a great dataset that you could play around with. And as usual, one things leads to another (it's all connected, remember!), and you end up with this incredible experiment that twists, turns and meanders into something fascinating. That's what happened here too.
That was so easy. I very quickly located a Dataset on Kaggle that I thought would be really interesting. It's a comma-separated file, about 110k lines long and 10MB in size, that holds all the lines that Shakespeare wrote for his plays. It's just an amazing dataset - not too complicated, but terribly interesting.
The structure of the file has the following File headers:
Dataline | Play | PlayerLinenumber | ActSceneLine | Player | PlayerLine |
---|---|---|---|---|---|
abc | def | ghi | jkl | mno | pqr |
Of course you can find the dataset on Kaggle yourself, but I actually quickly imported it into a google sheet version that you can access as well. This gsheet is shared and made public on the internet, and can then be downloaded as a csv at any time from this URL. This URL is what we will use for importing this data into Neo4j.
So let's see how we can do that.
Assuming that you are using one of the latest versions of Neo4j, which supports multiple databases, you should start by creating the database for this exercise:
:use system;
create or replace database shakespeare;
:use shakespeare;
Once that is done, you should also create some indexes on the database, as that will help with the data import and querying later on:
create index on :Play(name);
create index on :Player(name);
create index on :Scene(name);
create index on :Act(name);
create index on :Line(PlayerLine);
create index on :Line(Dataline);
create index on :Line(Play);
create index on :Line(Act);
create index on :Line(Scene);
Because dhte data is already in .csv format, and available on the web via the google-sheet-link above, importing the data as is into Neo4j is a no-brainer. All you need to do is use the LOAD CSV
command. Here's what that looks like:
Note that I needed to do one specific trick, and that is to convert the Dataline
and PlayerLinenumber
fields to integers, so that we could sort/sequence them later on. Other wise the create (l:Line)
statement could have just been folled by set l = line
- but we can't do that now.
Here's the import statement:
load csv with headers from "https://docs.google.com/spreadsheets/d/15c6eUbRMNDrPa0RTuzdrY46OAr2FzKH8tD0KZNoaG8c/export?format=csv&gid=1470339152" as line
create (l:Line)
set l.Dataline = toInteger(line.Dataline)
set l.Play = line.Play
set l.PlayerLinenumber = toInteger(line.PlayerLinenumber)
set l.ActSceneLine = line.ActSceneLine
set l.Player = line.Player
set l.PlayerLine = line.PlayerLine;
Now that the data is in Neo4j, we can start wrangling it into a much more graphy data structure. Here's how we do that.
We already have Player
in the Line
nodes. So let's extract that first and make them into separate nodes.
We will use a MERGE
operation for this to create the node and make sure that it does not get created twice. Next we add the relationship between the player and the line.
match (l:Line)
where l.Player is not null
with l
merge (pl:Player {name: l.Player})
create (pl)-[:ARTICULATES]->(l);
Next, we are going to look at
- where the Line fits into the
Scene
, - where the
Scene
fits into theAct
, - and where the
Act
fits into the Play
Here's how we do that.
We have a property on the Line
that has the ActSceneLine
for every line, separated by a .
. Let's first make separate properties of this composite property. Note that we have to account for some Line
nodes that don't have an ActSceneLine
property, as the original dataset did not have it.
match (l:Line)
where l.ActSceneLine is not null
with l, split(l.ActSceneLine,".") as Array
set l.Act = Array[0]
set l.Scene = Array[1]
set l.Line = Array[2];
So now we can proceed with creating a hierarchy (Play>>Act>>Scene>>Line) for every Play
.
Here's how we create the Scene
nodes, and link them to the Line
s.
match (l:Line)
where l.ActSceneLine is not null
merge (sc:Scene {name: l.Play+" - Act "+l.Act+" - Scene "+l.Scene})
create (l)-[:PART_OF]->(sc);
Next we can link the Scene
s to the Act
s:
match (l:Line)-->(sc:Scene)
where l.ActSceneLine is not null
merge (a:Act {name: l.Play+" - Act "+l.Act})
merge (sc)-[:PART_OF]->(a);
And finally we can link the Act
s to the Play
s:
match (l:Line)-->(sc:Scene)-->(a:Act)
where l.ActSceneLine is not null
merge (p:Play {name: l.Play})
merge (a)-[:PART_OF]->(p);
One last thing to clean up, is the fact that there are some Line
nodes that don't have an ActScenLine
property, and therefore don't have a Scene
or an Act
, but that do need to be linked to the Play
:
match (l:Line)
where l.ActSceneLine is null
merge (p:Play {name: l.Play})
merge (l)-[:PART_OF]->(p);
Next, we will start making the model a bit more understandable.
Currently, the model basically have a Line
connected to every Scene
that is in an Act
of a Play
. That works fine, but it does not give us a lot of clues as to how the play would work. That's why I wanted to create a sequential loop of Lines for every Scene: every Line in the Scene would basically connect to the next one, and then the next one, and then... and so on. Here's how we do that.
We start by linking the lines in a chain.
We will use the Dataline
property of every Line
for this.
match (l1:Line), (l2:Line)
where id(l1)>id(l2)
and l1.Play = l2.Play
and l1.Dataline = l2.Dataline + 1
create (l2)-[:FOLLOWED_BY]->(l1);
Then we proceed by connecting the first and last line to the scene with a specific STARTS_WITH
and ENDS
relationship.
We find the first Dataline
element and start with that:
match (l:Line)-->(s:Scene)
with s, min(l.Dataline) as startline
match (l:Line)
where l.Dataline = startline
create (s)-[:STARTS_WITH]->(l);
And then we find the last Dataline
element and end with that:
match (l:Line)-->(s:Scene)
with s, max(l.Dataline) as endline
match (l:Line)
where l.Dataline = endline
create (s)<-[:ENDS]-(l);
Now we can also remove the link between Line
and Scene
:
match (l:Line)-[pao:PART_OF]-(sc:Scene)
delete pao;
So how do we use this? We'll took a look at that later when we start querying the data.
Let's now explore some more advanced, data science style use cases for this dataset.
One thing that I was trying to figure out, is if the graph could help me understand which characters/players in the graph are more important than others. There's different ways of doing that for sure, and I will just explore two in this article.
Sounds like a simple enough proxy for importance, right? If a Player has more lines, there's a likelihood that they will have a more important role in the story. So let's go there.
First we need to connect the Players
to the Plays
for this. That's easy enough - as the indirect connection is of course already there. Here's an easy way to achieve what we need:
match (pl:Player)-->(l:Line)
with pl, l
match (p:Play {name:l.Play})
merge (pl)-[pi:PLAYS_IN]->(p)
on create set pi.nroflines=1
on match set pi.nroflines= pi.nroflines+1;
Note that the [PLAYS_IN]
relationship now also aggregates the nroflines
that a Player has had in a property on the relationship, aka the number of lines that a Player has spoken in a particular play.
Next, I wanted to write a query that would find the top 3 Players
in every Play
. We use a query with a subquery for that: the first part finds all the Plays, and then for every Play I look for the Players and the number of lines that I have stored on the relationship.
match (p:Play)
call {
with p
match (pl:Player)-[pi:PLAYS_IN]->(p)
return p.name as Play, pl.name as Players, pi.nroflines as NrOfLines
order by NrOfLines desc
limit 3
}
return Play, collect(Players) as TopPlayers, collect(NrOfLines) as TopPlayersLines
order by Play;
This already gives us a nice little indication of the importance of the Players, but I would like to suggest a more advanced approach.
Here's what I want to do: I would like to infer a new kind of relationship in our graph, called RELATED_TO
. This relationship would be introduced between two Player nodes, if the Players had been appearing together in one of 3 levels:
- appearing together in the Play, ie level 1
- appearing together in an Act of a Play, ie level 2
- appearing together in a Scene of an Act of a Play, ie level 3
This new relationship will create a mono-partite subgraph of (Players)-[:RELATED_TO]->(OtherPlayers)
, which will be very useful for graph data science work later on. So let's create this.
Here's the query for that:
match (pl1:Player)-->(p:Play)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=1;
This will require a two step process:
It's very similar to how we linked Players to Plays:
match (pl:Player)-->(l:Line)
with pl, l
match (a:Act {name:l.Play+" - Act "+l.Act})
merge (pl)-[pi:PLAYS_IN]->(a)
on create set pi.nroflines=1
on match set pi.nroflines= pi.nroflines+1;
Here's how we can create the relationships between players based on being in the same act:
match (pl1:Player)-->(a:Act)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=2;
Again, we need two steps:
We go about this in a very similar way:
match (pl:Player)-->(l:Line)
with pl, l
match (s:Scene {name:l.Play+" - Act "+l.Act+" - Scene "+l.Scene})
merge (pl)-[pi:PLAYS_IN]->(s)
on create set pi.nroflines=1
on match set pi.nroflines= pi.nroflines+1;
Again, very similar to the above:
match (pl1:Player)-->(s:Scene)<--(pl2:Player)
where id(pl1)>id(pl2)
merge (pl1)-[r:RELATED_TO]->(pl2)
set r.level=3;
That sets us up nicely for a couple of interesting explorations. Let's get into that.
Of course there are some great ways to now start working with the data. First we will do some simple queries in the Neo4j Browser.
Let's look at this in two ways:
Here's a fairly simple Cypher query, that would look at one entire scene. We are taking a scene from Romeo and Juliet in this case.
match entirescene = (p:Play)--(a:Act)--(s:Scene)-[:STARTS_WITH]->(firstline:Line)-[:FOLLOWED_BY*]-(lastline:Line)-[:ENDS]-(s)
where p.name contains "Romeo"
with entirescene, nodes(entirescene) as nodes
limit 1
unwind nodes as node
match (node)-[r]-(conn)
return entirescene, node, r, conn;
Obviously that's not the greatest visualisation. So let's improve that.
In Neo4j Bloom
In Neo4j Bloom, we can actually customize this query above, by making it into a search phrase. Essentially we parametrise the Play
name in the search phrase (look for the $param
in the screenshot below):
The result then looks like this: This is clearly a lot easier to look at.
Let's look at another query pattern.
Based on the [RELATED_TO]
relationship that we created, we can now look at the players and their "network" of interactions during the play.
Here's a simple view of the network of Player relations based on the relations above, for the Romeo and Juliet
Play. If we run this query:
match (pl1:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
with pl1
match playerrelations = (pl1)-[:RELATED_TO]-(pl2:Player)
return playerrelations;
The result very quickly becomes a bit of a hairball:
But luckily, we can also parametrise this as a search phrase in Bloom.
In Neo4j Bloom
Here's what the phrase looks like: And applying that becomes a much more interesting picture: Which allows me to very quickly zoom into the more improtant "Player nodes":
The point here is of course that, without reading a single line of the text, the graph is telling me which Players are likely to be more important than others. I just love that. I think this is why we can also apply this to so many other domains. The graph structure is immediately giving us insights.
Now let's see how we can enhance this even further, by applying graph algorithms from the Graph Data Science Library to this structure. Should be fun!
Now that we have that RELATED_TO
relationship, we can actually do some very interesting graph data science work, as this is now a mono-partite subgraph, containing only Player
nodes and RELATED_TO
relationships.
I am a big fan of using Neuler for doing some of this simple graph data science work. It's just a few clicks away, and it generates the code for the most interesting algorithms. I have picked two in this case: Pagerank and Betweenness, both of them different variations of Centrality calculation algorithms.
Calculating Pagerank centrality
Here's how we do that:
Pagerank centrality
With a few clicks we can actually configure the algorithm on Neuler.
The code that is actually being run for this looks like this:
:param limit => ( 42);
:param config => ({
nodeProjection: 'Player',
relationshipProjection: {
relType: {
type: 'RELATED_TO',
orientation: 'UNDIRECTED',
properties: {
level: {
property: 'level',
defaultValue: 1
}
}
}
},
relationshipWeightProperty: 'level',
dampingFactor: 0.85,
maxIterations: 20,
writeProperty: 'pagerank'
});
:param communityNodeLimit => ( 10);
CALL gds.pageRank.write($config);
Once that's done, we can run a very simple Cypher query to show the Pagerank property of all the Players:
match (pl:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
return distinct pl.name, pl.pagerank, pl.betweenness
order by pl.pagerank desc
limit 10;
Then we can also run another interesting centrality metric. Here's how we do that:
Calculating Betweenness centrality
With a few clicks we can actually configure the algorithm on Neuler.
:param limit => ( 42);
:param config => ({
nodeProjection: 'Player',
relationshipProjection: {
relType: {
type: 'RELATED_TO',
orientation: 'UNDIRECTED',
properties: {}
}
},
writeProperty: 'betweenness'
});
:param communityNodeLimit => ( 10);
CALL gds.betweenness.write($config);
Once that's done, we can run a very simple Cypher query to show the Betweenness of players
match (pl:Player)-->(l:Line)-->(:Scene)-->(a:Act)-->(p:Play {name: "Romeo and Juliet"})
return distinct pl.name, pl.pagerank, pl.betweenness
order by pl.betweenness desc
limit 10;
No doubt there are tons of additional things we could do with this dataset, but here's where my exercise will end. I am hoping that this was a useful story for you - it definitely was for me.
All the best
Rik Van Bruggen