All the Dadjokes are in this Twitter list. It's a very funny list.
Interesting to work directly with the Twitter API if you want to:
:param listofaccounts => [["baddadjokes",268948349],["dadsaysjokes",905028905026846720],["dadsjokes",106358414],["ThePunnyWorld",2472108787],["DadJokeMan",766659901242413057],["GroanBot",1065122605483474944],["RandomJokesIO",954397214004031489]];
:param twitterbearer: <<TWITTERBEARER>>;
UNWIND $listofaccounts AS account
CALL apoc.load.jsonParams(
"https://api.twitter.com/2/users/" + account[1] + "/tweets",
{Authorization: "Bearer "+$twitterbearer, exclude: "retweets,replies"},
null
)
YIELD value
UNWIND value.data AS status
CREATE (t:Tweet {id: status.id, handle: account[0], text: status.text});
Ended up using vicinitas.io: this tool allows you to dowload a really good chunk of tweets from specific Twitter handles - but it is rate-limited to 3200 tweets per account, for the free version. For what I want to do here, though, that is plenty. Downloaded it for the accounts above, and put everything together into one gsheet.
Now, you can either
- created a
.csv
version of it (example over here) and put that in the./import
directory of your Neo4j server, - or download the
.csv
version of the Google Sheet straight from the export URL.
Now that we have a nice little dataset to play with, let's proceed to importing that into our database.
First things first: let's set up the indexes that we will need later on in this process:
CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;
Let's create the graph straight from the Google Sheet:
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;
Then, we will proceed to do some cleanup:
Here's how we use DETACH DELETE
to weed out some of the tweets that will not yield interesting dadjokes:
MATCH (t:Tweet)
WHERE t.Text STARTS WITH "http"
OR t.Text STARTS WITH "RT"
OR t.Text STARTS WITH "@"
DETACH DELETE t;
The Tweet nodes will be connected to their Handle nodes, using the MERGE
command:
MATCH (t:Tweet)
MERGE (h:Handle {name: t.`Screen Name`})
CREATE (t)<-[:TWEETS {datetime: datetime(t.UTC)}]-(h);
Now comes the interesting part - creating Dadjoke
s out of the Tweets.
The MERGE
statement will make sure that identical jokes don't get created twice, thereby already deduplicating some jokes.
MATCH (t:Tweet)
MERGE (dj:Dadjoke {Text: t.Text})
CREATE (t)-[:REFERENCES_DADJOKE]->(dj);
We went from 18185 Tweet
s in the database, to 14313 Dadjoke
s.
We can see that by running this query:
MATCH (t1:Tweet), (t2:Tweet)
WHERE t1 <> t2
AND t1.Text = t2.Text
RETURN t1.`Screen Name`, t2.`Screen Name`, t1.Text;
So there is clearly a disambiguation task at hand here! Let's pursue it! But before we do that, we should also think about some housekeeping.
MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) as sumoffavorites, sum(toInteger(t.Retweets)) as sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;
Look at the "Amazon" example:
MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
RETURN path;
The first strategy is all about comparing the Text
property on the Dadjoke
nodes. Turns out there is a second strategy based on really interesting algorithms that calculate "string metrics" that compare two different strings with one another, and express how similar they are in a numerical score. I found this guide really useful to understand it a bit better.
We will use two similarity coefficients:
So let's calculate this using apoc.text.levenshteinsimilarity
and apoc.text.sorencenDiceSimilarity
. Both of these will yield a value between 0 and 1 - where 0 is very low similarity, and 1 is where the strings are identical.
Let's take a look:
MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,30) = left(dj2.Text,30)
WITH dj1.Text AS dj1text, dj2.Text AS dj2text, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity,
apoc.text.sorensenDiceSimilarity(dj1.Text, dj2.Text) AS SorensenDiceSimilarity
WHERE LevenshteinSimilarity < 0.65
RETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity
ORDER BY LevenshteinSimilarity DESC;
Then we can decide, by looking at the data, where we should be cutting off the similarity scores. When I looked at this, I judged that if the similarity was greater than 0.7
, the two texts of the Dadjoke
nodes were likely to be the same. Lower than that and the texts started to diverge quite significantly. So in the next phase I will be merging the jokes that have similarity scores above my cutoff point.
In case the Levenshtein similarity is larger than 0.7, we would want the Dadjoke
nodes to be merged (and the links to the Tweets to be moved as well). That would be an interesting use case for one of the graph refactoring procedures in apoc: apoc.refactor.mergeNodes
will do that trick.
:auto MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,20) = left(dj2.Text,20)
WITH [dj1,dj2] AS nodes, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity
WHERE LevenshteinSimilarity > 0.7
CALL {
WITH nodes
CALL apoc.refactor.mergeNodes(nodes, {properties: {Text: "discard", SumOfFavorites: "combine", SumOfRetweets: "combine"}, mergeRels:True}) YIELD node
RETURN count(node) AS mergednodes
} IN TRANSACTIONS OF 1 ROWS
RETURN sum(mergednodes) as countofmergedbodes;
Note that we have to make these changes one by one, based on the fact that it could otherwise lead to a deadlock situation.
Before ending this section, we do need to re-aggregate the Tweet
stats onto the Dadjoke
nodes. Same procedure as before:
MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) AS sumoffavorites, sum(toInteger(t.Retweets)) AS sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;
That wraps up the disambiguation based on string metrics. Now, let's get to the true power of the graph, but using NLP and Entity Extraction first, and graph analytics second.
Now we will take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.
Here's what we are going to do:
- we are going to use Natural Language Processing to extract the entities that are mentioned in our
Dadjoke
s. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke. - then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.
So let's start with the start.
I have actually experimented with this technique before - and found it to be extremely powerful. See some posts that I wrote before: see over here and over here if you want to see other examples. Using the Natural Language Processing (NLP) procedures in the apoc libraries, we can now call the NLP services of GCP, AWS and Azure really easily, and use them to automatically extract the entities from the text of the Dadjoke
s.
So, after having installed the required NLP .jar
file in the Plugin
directory of the Neo4j server, we can start analysing the descriptions using the Google Cloud NLP service. Here's how that works:
:param apiKey =>("SOMEAPIKEY");
:auto MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
WITH dj
CALL {
WITH dj
CALL apoc.nlp.gcp.entities.graph(dj, {
key: $apiKey,
nodeProperty: "Text",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN g
} IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";
Two more comments on the NLP topic:
- There seem to be some types of text that the Google NLP engine was struggling with, and could not extract entities from. Look at this query:
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
RETURN dj.Text
and notice that the results have a lot of plays on words related to numbers. There could be ways to try to fix that, for example by replacing numbers with words - but I did not go into that.
- If you don't have the time, or the $$$, to run the NLP on all these nodes, you could also just do the NLP on a limited subset of nodes:
:auto MATCH (dj:Dadjoke)
WHERE dj.Text CONTAINS "amazon"
WITH dj
CALL {
WITH dj
CALL apoc.nlp.gcp.entities.graph(dj, {
key: $apiKey,
nodeProperty: "Text",
scoreCutoff: 0.01,
writeRelationshipType: "HAS_ENTITY",
writeRelationshipProperty: "gcpEntityScore",
write: true
})
YIELD graph AS g
RETURN g
} IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";
That really only has a handful of nodes to process, so it would return quite quickly, and the financial cost would negligible.
CREATE INDEX entity_index FOR (e:Entity) ON e.text;
CREATE INDEX entity_rel_index FOR ()-[he:HAS_ENTITY]-() ON (he.gcpEntityScore);
Refactoring after NLP-based entity extraction: consolidating the Entity
nodes with different capitalisation
After running the NLP procedures, we quickly notice that there is some duplication in the Entity
nodes that have been detected. Specifically, the capitalisation of the Entity
nodes can be quite confusing and we should take some care to resolve this - which is very easily done.
:auto MATCH (e1:Entity), (e2:Entity)
WHERE ID(E1)<ID(E2)
AND toLower(e1.text) = toLower(e2.text)
CALL {
WITH e1, e2
MATCH (e2)<-[:HAS_ENTITY]-(dj:Dadjoke)
CREATE (dj)-[:HAS_ENTITY]->(e1)
WITH e1, e2
CALL apoc.create.setLabels(e1,labels(e2))
YIELD node
SET e1.text = toLower(e1.text)
DETACH DELETE e2
} IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";
We can just quickly explore a specific part of the graph that we have explored before now:
MATCH path = (e:Entity)--(dj:Dadjoke)-[REFERENCES_DADJOKE]-(t:Tweet)--(h:Handle)
WHERE dj.Text CONTAINS "amazon"
RETURN path;
Note that we now see some notable differences:
- 2 dadjokes out of 3 mention Jeff's
pyjamas
, and 1 of the dadjokes mentionspajamas
- 2 dadjokes out of 3 mention
Jeff Bezos
, and 1 of the dadjokes mentionseff Bezos
- there is no overlap between those sets of 2 dadjokes
- all they have in common is the
Bed
entity, therefore
This will be of importance when we calculate our next set of similarity scores.
To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia.
:param limit => ( 500);
:param graphConfig => ({
nodeProjection: '*',
relationshipProjection: {
relType: {
type: 'HAS_ENTITY',
orientation: 'NATURAL',
properties: {}
}
}
});
:param config => ({
similarityMetric: 'Jaccard',
similarityCutoff: 0,
degreeCutoff: 2,
writeProperty: 'score',
writeRelationshipType: 'JACCARD_SIMILAR'
});
:param communityNodeLimit => ( 10);
:param generatedName => ('in-memory-graph-1663777188212');
CALL gds.graph.project($generatedName, $graphConfig.nodeProjection, $graphConfig.relationshipProjection, {});
CALL gds.nodeSimilarity.write($generatedName, $config);
MATCH (from)-[rel:`JACCARD_SIMILAR`]-(to)
WHERE exists(rel.`score`)
RETURN from, to, rel.`score` AS similarity
ORDER BY similarity DESC
LIMIT toInteger($limit);
CALL gds.graph.drop($generatedName);
This will generate a set of relationships between Dadjoke
nodes that will indicate how similar they are based on a JACCARD_SIMILAR
relationship that will have a score
weight property on it. We can easily add an index on that relationship:
CREATE INDEX jaccard_similarity_index FOR ()-[s:JACCARD_SIMILAR]-() ON (s.score);
We can explore this similarity with a few queries.
MATCH p=(h:Handle)--(t:Tweet)--(d:Dadjoke)-[r:JACCARD_SIMILAR]->()
AND d.Text contains "pyjamazon"
RETURN p;
Returning to our objective of disambiguation of the jokes: there seem to be hundreds of additional disambiguation results that we could eliminate using the Jaccard metric.
MATCH (d1:Dadjoke)-[r:JACCARD_SIMILAR]->(d2:Dadjoke)
WHERE r.score >0.9
and id(d1)>id(d2)
AND d1.Text contains "Doctor"
RETURN d1.Text, d2.Text, r.score;
Turns out there are about 844 of these types of eliminations that we could do.
MATCH p=()-[r:JACCARD_SIMILAR]->()
WHERE r.score >0.9
WITH count(p) AS count
RETURN count;
We could then of course also actually perform the disambiguation now and remove the duplicate dadjokes based on the JACCARD_SIMILAR
score.
Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.
MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
WITH dj.Text AS Joke, count(r) AS NrOfTimesTweeted
RETURN Joke, NrOfTimesTweeted
ORDER BY NrOfTimesTweeted DESC
LIMIT 10;
MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted
ORDER BY NrOfTimesFavorited DESC
LIMIT 10;
Let's explore 3 alternative ways to find jokes about cars.
MATCH (dj:Dadjoke) WHERE dj.Text CONTAINS "car" RETURN dj.Text LIMIT 10;
MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text CONTAINS "car" RETURN dj.Text LIMIT 10;
MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text = "car" RETURN dj.Text LIMIT 10;
This was another great example:
MATCH p=(h:Handle)--(t:Tweet)--(dj:Dadjoke)-[r:JACCARD_SIMILAR]->()
WHERE dj.Text CONTAINS "spaghetti"
AND (dj.Text CONTAINS "bike" OR dj.Text CONTAINS "car")
RETURN p;
It's amazing to see how the same conceptual joke is being reused in different ways!
WITH ["car","spaghetti","water","boil","hell"] AS entities
MATCH p = (h:Handle)--(t:Tweet)--(dj:Dadjoke)--(e:Entity)
WHERE e.text IN entities
RETURN p;
MATCH (h:Handle)--(t:Tweet)--(d:Dadjoke)
WHERE d.Text contains "hell"
AND d.Text CONTAINS "water"
RETURN d;
MATCH path = (h1:Handle)-[*2..2]->(dj:Dadjoke)<-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
RETURN path;
This takes a while to load, but you can clearly see a few cliques in this picture.
Let's see how many such paths are actually there:
MATCH path = (h1:Handle)-[*2..2]-(dj:Dadjoke)-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
WITH h1.name AS FirstHandle, h2.name AS SecondHandle, count(path) AS NrOfSharedJokes
RETURN FirstHandle, SecondHandle,NrOfSharedJokes
ORDER BY NrOfSharedJokes DESC;
The result is quite enlightning: GroanBot and RandomJokesIO are clearly reinforcing one another. My personal guess is that they are truly just bots.
We already have the Favorite/Retweet scores of all the dadjokes summed up, so we can also look at which Entity
nodes have the highest scores that way:
MATCH (e:Entity)--(dj:Dadjoke)
WITH e, sum(toInteger(dj.SumOfFavorites)) AS sumofsumoffavorites, sum(toInteger(dj.SumOfRetweets)) AS sumofsumofretweets
SET e.SumOfSumOfFavorites = sumofsumoffavorites
SET e.SumOfSumOfRetweets = sumofsumofretweets;
This operation finishes very quickly, and so then we can do the exploration quite easily, and figure out what the entities are that our dadjokers are mostly joking about:
MATCH (e:Entity)
RETURN e.text, e.SumOfSumOfFavorites AS EntityFavoriteScore, e.SumOfSumOfRetweets AS EntityRetweetScore
ORDER BY EntityFavoriteScore DESC
LIMIT 10;
Surprise: it's about wives and bosses. Right!
You can find the full blogpost of this guide on this page and the full markdown file over here.
Cheers
Rik