Skip to content

Instantly share code, notes, and snippets.

@rvanbruggen
Last active October 20, 2022 13:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rvanbruggen/a55a88e6fafe01cc924c90a28e0c3ed4 to your computer and use it in GitHub Desktop.
Save rvanbruggen/a55a88e6fafe01cc924c90a28e0c3ed4 to your computer and use it in GitHub Desktop.
A Graph Database and a Dadjoke walk into a bar...

Part 1/6: Building a Dadjoke database - from nothing

All the Dadjokes are in this Twitter list. It's a very funny list.

Interesting to work directly with the Twitter API if you want to:

:param listofaccounts => [["baddadjokes",268948349],["dadsaysjokes",905028905026846720],["dadsjokes",106358414],["ThePunnyWorld",2472108787],["DadJokeMan",766659901242413057],["GroanBot",1065122605483474944],["RandomJokesIO",954397214004031489]];

:param twitterbearer: <<TWITTERBEARER>>;

UNWIND $listofaccounts AS account
CALL apoc.load.jsonParams(
  "https://api.twitter.com/2/users/" + account[1] + "/tweets",
  {Authorization: "Bearer "+$twitterbearer, exclude: "retweets,replies"},
  null
)
YIELD value
UNWIND value.data AS status
CREATE (t:Tweet {id: status.id, handle: account[0], text: status.text});

Ended up using vicinitas.io: this tool allows you to dowload a really good chunk of tweets from specific Twitter handles - but it is rate-limited to 3200 tweets per account, for the free version. For what I want to do here, though, that is plenty. Downloaded it for the accounts above, and put everything together into one gsheet.

Now, you can either

  • created a .csv version of it (example over here) and put that in the ./import directory of your Neo4j server,
  • or download the .csv version of the Google Sheet straight from the export URL.

Now that we have a nice little dataset to play with, let's proceed to importing that into our database.


Part 2/6: Importing the Dadjokes into the Dadjoke Graph

First things first: let's set up the indexes that we will need later on in this process:

CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;

Let's create the graph straight from the Google Sheet:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;

Then, we will proceed to do some cleanup:

1. Remove some of the redundant Tweets

Here's how we use DETACH DELETE to weed out some of the tweets that will not yield interesting dadjokes:

MATCH (t:Tweet)
WHERE t.Text STARTS WITH "http"
    OR t.Text STARTS WITH "RT"
    OR t.Text STARTS WITH "@"
    DETACH DELETE t;

2. Connect the Tweets to the Handles

The Tweet nodes will be connected to their Handle nodes, using the MERGE command:

MATCH (t:Tweet)
MERGE (h:Handle {name: t.`Screen Name`})
CREATE (t)<-[:TWEETS {datetime: datetime(t.UTC)}]-(h);

Now comes the interesting part - creating Dadjokes out of the Tweets.


From tweet to dadjoke!

Create Jokes out of Tweets

The MERGE statement will make sure that identical jokes don't get created twice, thereby already deduplicating some jokes.

MATCH (t:Tweet)
MERGE (dj:Dadjoke {Text: t.Text})
CREATE (t)-[:REFERENCES_DADJOKE]->(dj);

We went from 18185 Tweets in the database, to 14313 Dadjokes.

We can see that by running this query:

MATCH (t1:Tweet), (t2:Tweet)
WHERE t1 <> t2
AND t1.Text = t2.Text
RETURN t1.`Screen Name`, t2.`Screen Name`, t1.Text;

So there is clearly a disambiguation task at hand here! Let's pursue it! But before we do that, we should also think about some housekeeping.

Aggregate the Favorite/Retweet scores of the tweets into the jokes

MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) as sumoffavorites, sum(toInteger(t.Retweets)) as sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;

Part 3/6: Taking on the real disambiguation of the jokes

Look at the "Amazon" example:

MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
    RETURN path;

Disambiguation based on String Metrics

The first strategy is all about comparing the Text property on the Dadjoke nodes. Turns out there is a second strategy based on really interesting algorithms that calculate "string metrics" that compare two different strings with one another, and express how similar they are in a numerical score. I found this guide really useful to understand it a bit better.

Comparing jokes using Levenshtein and Sørensen-Dice similarities

We will use two similarity coefficients:

So let's calculate this using apoc.text.levenshteinsimilarity and apoc.text.sorencenDiceSimilarity. Both of these will yield a value between 0 and 1 - where 0 is very low similarity, and 1 is where the strings are identical.

Let's take a look:

MATCH (dj1:Dadjoke), (dj2:Dadjoke)
  WHERE id(dj1)<id(dj2)
  AND dj1.Text <> dj2.Text
  AND left(dj1.Text,30) = left(dj2.Text,30)
WITH dj1.Text AS dj1text, dj2.Text AS dj2text, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity,
apoc.text.sorensenDiceSimilarity(dj1.Text, dj2.Text) AS SorensenDiceSimilarity
  WHERE LevenshteinSimilarity < 0.65
RETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity
ORDER BY LevenshteinSimilarity DESC;

Then we can decide, by looking at the data, where we should be cutting off the similarity scores. When I looked at this, I judged that if the similarity was greater than 0.7, the two texts of the Dadjoke nodes were likely to be the same. Lower than that and the texts started to diverge quite significantly. So in the next phase I will be merging the jokes that have similarity scores above my cutoff point.

Merge the nodes that have a Levenshtein similarity > 0.7

In case the Levenshtein similarity is larger than 0.7, we would want the Dadjoke nodes to be merged (and the links to the Tweets to be moved as well). That would be an interesting use case for one of the graph refactoring procedures in apoc: apoc.refactor.mergeNodes will do that trick.

:auto MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,20) = left(dj2.Text,20)
WITH [dj1,dj2] AS nodes, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity
WHERE LevenshteinSimilarity > 0.7
CALL {
    WITH nodes
    CALL apoc.refactor.mergeNodes(nodes, {properties: {Text: "discard", SumOfFavorites: "combine", SumOfRetweets: "combine"}, mergeRels:True}) YIELD node
    RETURN count(node) AS mergednodes
    } IN TRANSACTIONS OF 1 ROWS
RETURN sum(mergednodes) as countofmergedbodes;

Note that we have to make these changes one by one, based on the fact that it could otherwise lead to a deadlock situation.

Before ending this section, we do need to re-aggregate the Tweet stats onto the Dadjoke nodes. Same procedure as before:

Sum up the Retweet/Favorite stats

MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) AS sumoffavorites, sum(toInteger(t.Retweets)) AS sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;

That wraps up the disambiguation based on string metrics. Now, let's get to the true power of the graph, but using NLP and Entity Extraction first, and graph analytics second.


Part 4/6: Adding NLP and Entity Extraction to prepare for further disambiguation

Now we will take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.

Here's what we are going to do:

  1. we are going to use Natural Language Processing to extract the entities that are mentioned in our Dadjokes. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke.
  2. then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.

So let's start with the start.

1. NLP on the Dadjokes

I have actually experimented with this technique before - and found it to be extremely powerful. See some posts that I wrote before: see over here and over here if you want to see other examples. Using the Natural Language Processing (NLP) procedures in the apoc libraries, we can now call the NLP services of GCP, AWS and Azure really easily, and use them to automatically extract the entities from the text of the Dadjokes.

So, after having installed the required NLP .jar file in the Plugin directory of the Neo4j server, we can start analysing the descriptions using the Google Cloud NLP service. Here's how that works:

:param apiKey =>("SOMEAPIKEY");

:auto MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
WITH dj
CALL {
    WITH dj
    CALL apoc.nlp.gcp.entities.graph(dj, {
        key: $apiKey,
        nodeProperty: "Text",
        scoreCutoff: 0.01,
        writeRelationshipType: "HAS_ENTITY",
        writeRelationshipProperty: "gcpEntityScore",
        write: true
        })
    YIELD graph AS g
    RETURN g
    } IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

Two more comments on the NLP topic:

  1. There seem to be some types of text that the Google NLP engine was struggling with, and could not extract entities from. Look at this query:
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
RETURN dj.Text

and notice that the results have a lot of plays on words related to numbers. There could be ways to try to fix that, for example by replacing numbers with words - but I did not go into that.

  1. If you don't have the time, or the $$$, to run the NLP on all these nodes, you could also just do the NLP on a limited subset of nodes:
:auto MATCH (dj:Dadjoke)
WHERE dj.Text CONTAINS "amazon"
WITH dj
CALL {
    WITH dj
    CALL apoc.nlp.gcp.entities.graph(dj, {
        key: $apiKey,
        nodeProperty: "Text",
        scoreCutoff: 0.01,
        writeRelationshipType: "HAS_ENTITY",
        writeRelationshipProperty: "gcpEntityScore",
        write: true
        })
    YIELD graph AS g
    RETURN g
    } IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

That really only has a handful of nodes to process, so it would return quite quickly, and the financial cost would negligible.

Indexing the Entities that we extracted

CREATE INDEX entity_index FOR (e:Entity) ON e.text;
CREATE INDEX entity_rel_index FOR ()-[he:HAS_ENTITY]-() ON (he.gcpEntityScore);

Refactoring after NLP-based entity extraction: consolidating the Entity nodes with different capitalisation

After running the NLP procedures, we quickly notice that there is some duplication in the Entity nodes that have been detected. Specifically, the capitalisation of the Entity nodes can be quite confusing and we should take some care to resolve this - which is very easily done.

:auto MATCH (e1:Entity), (e2:Entity)
WHERE ID(E1)<ID(E2)
AND toLower(e1.text) = toLower(e2.text)
CALL {
  WITH e1, e2
  MATCH (e2)<-[:HAS_ENTITY]-(dj:Dadjoke)
  CREATE (dj)-[:HAS_ENTITY]->(e1)
  WITH e1, e2
  CALL apoc.create.setLabels(e1,labels(e2))
    YIELD node
  SET e1.text = toLower(e1.text)
  DETACH DELETE e2
} IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

We can just quickly explore a specific part of the graph that we have explored before now:

MATCH path = (e:Entity)--(dj:Dadjoke)-[REFERENCES_DADJOKE]-(t:Tweet)--(h:Handle)
WHERE dj.Text CONTAINS "amazon"
    RETURN path;

Note that we now see some notable differences:

  • 2 dadjokes out of 3 mention Jeff's pyjamas, and 1 of the dadjokes mentions pajamas
  • 2 dadjokes out of 3 mention Jeff Bezos, and 1 of the dadjokes mentions eff Bezos
  • there is no overlap between those sets of 2 dadjokes
  • all they have in common is the Bed entity, therefore

This will be of importance when we calculate our next set of similarity scores.


Part 5/6: Disambiguation using Graph Data Science on the NLP-based Entities

Calculating the Jaccard similarity metric using GDS

To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia.

:param limit => ( 500);
:param graphConfig => ({
  nodeProjection: '*',
  relationshipProjection: {
    relType: {
      type: 'HAS_ENTITY',
      orientation: 'NATURAL',
      properties: {}
    }
  }
});
:param config => ({
  similarityMetric: 'Jaccard',
  similarityCutoff: 0,
  degreeCutoff: 2,
  writeProperty: 'score',
  writeRelationshipType: 'JACCARD_SIMILAR'
});
:param communityNodeLimit => ( 10);
:param generatedName => ('in-memory-graph-1663777188212');

CALL gds.graph.project($generatedName, $graphConfig.nodeProjection, $graphConfig.relationshipProjection, {});

CALL gds.nodeSimilarity.write($generatedName, $config);

MATCH (from)-[rel:`JACCARD_SIMILAR`]-(to)
WHERE exists(rel.`score`)
RETURN from, to, rel.`score` AS similarity
ORDER BY similarity DESC
LIMIT toInteger($limit);

CALL gds.graph.drop($generatedName);

This will generate a set of relationships between Dadjoke nodes that will indicate how similar they are based on a JACCARD_SIMILAR relationship that will have a score weight property on it. We can easily add an index on that relationship:

CREATE INDEX jaccard_similarity_index FOR ()-[s:JACCARD_SIMILAR]-() ON (s.score);

We can explore this similarity with a few queries.

MATCH p=(h:Handle)--(t:Tweet)--(d:Dadjoke)-[r:JACCARD_SIMILAR]->() 
AND d.Text contains "pyjamazon"
RETURN p;

Using Jaccard similarity for disambiguation

Returning to our objective of disambiguation of the jokes: there seem to be hundreds of additional disambiguation results that we could eliminate using the Jaccard metric.

MATCH (d1:Dadjoke)-[r:JACCARD_SIMILAR]->(d2:Dadjoke) 
WHERE r.score >0.9
and id(d1)>id(d2)
AND d1.Text contains "Doctor"
RETURN d1.Text, d2.Text, r.score;

Turns out there are about 844 of these types of eliminations that we could do.

MATCH p=()-[r:JACCARD_SIMILAR]->() 
WHERE r.score >0.9
WITH count(p) AS count
RETURN count; 

We could then of course also actually perform the disambiguation now and remove the duplicate dadjokes based on the JACCARD_SIMILAR score.


Part 6/6: Closing: some cool Dadjoke Queries

Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.

How many times does a joke get tweeted?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
WITH dj.Text AS Joke, count(r) AS NrOfTimesTweeted
RETURN Joke, NrOfTimesTweeted
ORDER BY NrOfTimesTweeted DESC
LIMIT 10;

How many times does a joke get favorited?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted
ORDER BY NrOfTimesFavorited DESC
LIMIT 10;

Different ways of finding jokes about cars

Let's explore 3 alternative ways to find jokes about cars.

1. Matching the text of the Dadjoke for the word "car"

MATCH (dj:Dadjoke) WHERE dj.Text CONTAINS "car" RETURN dj.Text LIMIT 10;

2. Checking if the Entity contains the word "car"

MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text CONTAINS "car" RETURN dj.Text LIMIT 10;

3. Checking if the Entity equals the word "car"

MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text = "car" RETURN dj.Text LIMIT 10;

Finding jokes about cars and wives - and spaghetti

This was another great example:

MATCH p=(h:Handle)--(t:Tweet)--(dj:Dadjoke)-[r:JACCARD_SIMILAR]->() 
WHERE dj.Text CONTAINS "spaghetti" 
    AND (dj.Text CONTAINS "bike" OR dj.Text CONTAINS "car")
    RETURN p;

It's amazing to see how the same conceptual joke is being reused in different ways!

Finding jokes about a list of entities

WITH ["car","spaghetti","water","boil","hell"] AS entities
MATCH p = (h:Handle)--(t:Tweet)--(dj:Dadjoke)--(e:Entity)
WHERE e.text IN entities
RETURN p;

Finding jokes about hell and water

MATCH (h:Handle)--(t:Tweet)--(d:Dadjoke) 
WHERE d.Text contains "hell" 
  AND d.Text CONTAINS "water" 
RETURN d;

Some interesting structural characteristics about the #dadjoke twitterspace

How many jokes are tweeted identically by different tweeters

MATCH path = (h1:Handle)-[*2..2]->(dj:Dadjoke)<-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
RETURN path;

This takes a while to load, but you can clearly see a few cliques in this picture.

Let's see how many such paths are actually there:

MATCH path = (h1:Handle)-[*2..2]-(dj:Dadjoke)-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
WITH h1.name AS FirstHandle, h2.name AS SecondHandle, count(path) AS NrOfSharedJokes
RETURN FirstHandle, SecondHandle,NrOfSharedJokes
ORDER BY NrOfSharedJokes DESC;

The result is quite enlightning: GroanBot and RandomJokesIO are clearly reinforcing one another. My personal guess is that they are truly just bots.

What are the most frequent entities

We already have the Favorite/Retweet scores of all the dadjokes summed up, so we can also look at which Entity nodes have the highest scores that way:

MATCH (e:Entity)--(dj:Dadjoke)
WITH e, sum(toInteger(dj.SumOfFavorites)) AS sumofsumoffavorites, sum(toInteger(dj.SumOfRetweets)) AS sumofsumofretweets
SET e.SumOfSumOfFavorites = sumofsumoffavorites
SET e.SumOfSumOfRetweets = sumofsumofretweets;

This operation finishes very quickly, and so then we can do the exploration quite easily, and figure out what the entities are that our dadjokers are mostly joking about:

MATCH (e:Entity)
RETURN e.text, e.SumOfSumOfFavorites AS EntityFavoriteScore, e.SumOfSumOfRetweets AS EntityRetweetScore
ORDER BY EntityFavoriteScore DESC
LIMIT 10;

Surprise: it's about wives and bosses. Right!


Wrapping up

You can find the full blogpost of this guide on this page and the full markdown file over here.

Cheers

Rik

A Graph Database and a Dadjoke walk into a bar...

Part 1/6: Building a Dadjoke database - from nothing

I am a dad. Happily married and 3 wonderful kids of 13, 17 and 19 years old. So all teenagers - with all the positive and not so positive experiences that can come with that. And in our family, I have been accused of using dadjokes - preferably at very awkward or inconvenient times. I actually like dadjokes. A lot.

So I follow a few accounts on social media that post these jokes. Like for example baddadjokes, dadjokeman, dadsaysjokes, dadsjokes, groanbot, punsandoneliner, randomjokesio, and thepunnyworld and there are many others. These are all in this list, should you be interest. It's a very funny list.

Dadjokers List on Twitter

Working at Neo4j, I have always been looking at different datasets that would be a great way to demonstrate the wonderful power of graphs in a fun and engaging way. Over the years, this has pulled me into a variety of different ratholes - and I have always loved that and learned so much by doing that. So... here we go again: I have been looking at different ways of building a dadjokes database. Wouldn't that be a ton of fun? Not just because of the jokes, but also as a means to learn more about the twitterspace, the topics of the jokes, the frequency and similarities of these jokes. Etc Etc. So: like always, I figured why not just give it a try. Here goes.

Here are the techniques that I tried to actually build the database:

  • scraping the data off of different websites - very cumbersome and really I gave up and did not even try very hard.
  • exporting stuff from instagram with instaloader and OCR-ing it with Tesseract. Was actually pleasantly surprised with how easy this was - but still felt quite clunky. While the OCR engine would get a LOT right, without any tuning, it still got a lot of stuff wrong. So I figured there had to be a better way.
  • using the Twitter API: if you have a Twitter Developer Account, you can get access to the API and pull information from there. For example like this:

Twitter API integration

:param listofaccounts => [["baddadjokes",268948349],["dadsaysjokes",905028905026846720],["dadsjokes",106358414],["ThePunnyWorld",2472108787],["DadJokeMan",766659901242413057],["GroanBot",1065122605483474944],["RandomJokesIO",954397214004031489]];

:param twitterbearer => "<somerandomstringsthatserveasbearer>";

UNWIND $listofaccounts AS account
CALL apoc.load.jsonParams(
  "https://api.twitter.com/2/users/" + account[1] + "/tweets",
  {Authorization: "Bearer "+$twitterbearer, exclude: "retweets,replies"},
  null
)
YIELD value
UNWIND value.data AS status
CREATE (t:Tweet {id: status.id, handle: account[0], text: status.text});
  • a set of tweets that I have started to gather using IFTTT, which adds tweets to a spreadsheet. This will take a long time to build up but should make for a fun follow up.

IFTTT sheet

IFTTT Tweets by Dadjokers Google Sheet

  • and FINALLY, LAST BUT NOT LEAST: using vicinitas.io: This allows you to dowload a really good chunk of tweets from specific Twitter handles - but it is rate-limited to 3200 tweets per account, for the free version. For what I want to do here, though, that is plenty. Downloaded it for the accounts above, and put everything together into one gsheet.

Vicinitas Google Sheet

Now, you can either

  • created a .csv version of it (example over here) and put that in the ./import directory of your Neo4j server,
  • or download the .csv version of the Google Sheet straight from the export URL.

Now that we have a nice little dataset to play with, let's proceed to importing that into our database.


A Graph Database and a Dadjoke walk into a bar...

Part 2/6: Importing the Dadjokes into the Dadjoke Graph

This means that we want to convert the spreadsheet that we created before, or the .csv version of it, into a Neo4j Database.

Here's how we go about this. First things first: let's set up the indexes that we will need later on in this process:

CREATE INDEX tweet_index FOR (t:Tweet) ON t.Text;
CREATE INDEX dadjoke_index for (d:Dadjoke) ON d.Text;

Assuming the .csv file mentioned above is in the import directory of the Neo4j server, we can use load csv to create the initial dataset:

LOAD CSV WITH HEADERS FROM "file:/vicinitas_alldadjokes_user_tweets.csv" AS csv
CREATE (t:Tweet)
SET t = csv;

Import the Vicinitas .csv file

Or: if you want to create the graph straight from the Google Sheet:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1MwHX5hM-Vda5o4ZQVnCv4upKepL5rHxcUrqfT69u5Ro/export?format=csv&gid=1582640786" AS csv
CREATE (t:Tweet)
SET t = csv;

Then, we will proceed to

  1. discard some of the obviously useless tweets that will not give us any good dadjokes, ie. the ones that
    • just contain a link,
    • are retweets of previous tweets (starting with the RT marker)
    • are just tagging someone else (starting with the @ marker)
  2. extract the Twitter handles (aka screen names) from the Tweets, and put them into seperate Handle nodes.

This is how we do both these things:

1. Remove some of the redundant Tweets

Here's how we use DETACH DELETE to weed out some of the tweets that will not yield interesting dadjokes:

MATCH (t:Tweet)
WHERE t.Text STARTS WITH "http"
    OR t.Text STARTS WITH "RT"
    OR t.Text STARTS WITH "@"
    DETACH DELETE t;

Remove redundant Tweets

2. Connect the Tweets to the Handles

The Tweet nodes will be connected to their Handle nodes, using the MERGE command:

MATCH (t:Tweet)
MERGE (h:Handle {name: t.`Screen Name`})
CREATE (t)<-[:TWEETS {datetime: datetime(t.UTC)}]-(h);

Now comes the interesting part - creating Dadjokes out of the Tweets.

Extract the Handles from the Tweets

Now we have the database full of Tweets and Handles, let's proceed to make that into a true Dadjoke database.


From tweet to dadjoke!

In the next few sections, we are going to explore how we can convert the Tweet nodes that we currently have in the database into a set of Dadjokes that we can work with later on. The complexity here is, obviously, that there will be quite a few tweets that will reference the same dadjokes, and we would like to actually make sure that we disambiguate and deduplicate these jokes. That will make for an interesting journey, and this is where Neo4j will clearly show its value.

Create Jokes out of Tweets

The MERGE statement will make sure that identical jokes don't get created twice, thereby already deduplicating some jokes.

MATCH (t:Tweet)
MERGE (dj:Dadjoke {Text: t.Text})
CREATE (t)-[:REFERENCES_DADJOKE]->(dj);

Extracting the dadjokes from the tweets

This is already interesting: we went from 18185 Tweets in the database, to 14313 Dadjokes - meaning that we have 3872 Tweets that had exactly the same text, and therefore the same Dadjoke. That alone is quite crazy.

Counts of Tweets, Handles, Dadjokes

We can see that by running this query:

MATCH (t1:Tweet), (t2:Tweet)
WHERE t1 <> t2
AND t1.Text = t2.Text
RETURN t1.`Screen Name`, t2.`Screen Name`, t1.Text;

2 Handles, 1 dadjoke

So there is clearly a disambiguation task at hand here! Let's pursue it! But before we do that, we should also think about some housekeeping, and aggregate the Favorite/Retweet scores of the tweets into the dadjokes.

Aggregate the Favorite/Retweet scores of the tweets into the jokes

As we will have created some jokes now that have more than 1 tweet referencing it, it is important to aggregate the Favorites and Retweets into the Dadjoke nodes, so that we can still have an understanding of the popularity of the joke (based on the popularity of the referring tweets).

MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) as sumoffavorites, sum(toInteger(t.Retweets)) as sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;

Aggregating Favorites/Retweets for Dadjokes

We will come back to this later on as well as we further disambiguate the jokes (and will need to sum it all up again).


A Graph Database and a Dadjoke walk into a bar...

Part 3/6: Taking on the real disambiguation of the jokes

We noticed that many of the Tweet nodes referred to the same jokes - and resolved that already above. But this query makes us understand that we actually still have some work to do:

MATCH path = (dj:Dadjoke)-[*..2]-(conn)
WHERE dj.Text CONTAINS "pyjamazon"
    RETURN path;

The Amazon Dadjokes

We will come back to that example below.

We now notice that there are quite a few Dadjoke nodes that are a bit different, but very similar. We would like to disambiguate these too. We will use a couple of different strategies for this, but start with a strategy that is based on String Metrics.

Disambiguation based on String Metrics

The first strategy is all about comparing the Text property on the Dadjoke nodes. We know they are not identical (see the MERGE that we did previously - that would have captured the identical ones!), but we actually know that some of these jokes are not the same but similar. So how to proceed with that? Well, turns out there are a lot of really interesting algorithms that calculate "string metrics" that compare two different strings with one another, and express how similar they are in a numerical score. I found this guide really useful to understand it a bit better. The article explains that, like with so many things, there are different ways of assessing that similarity, and none of them are per definition perfect. They all make trade-offs in terms of efficacy and accuracy, and largely complement each other instead of competing with one another. Let's explore a few of these metrics, as they are readily available in our apoc library.

Comparing jokes using Levenshtein and Sørensen-Dice similarities

The first interesting metric that we will use is based on the Levenshtein distance and use that to calculate how similar the strings are. The Levenshtein distance has some simple lower/upper bounds:

  • It is at least the difference of the sizes of the two strings.
  • It is at most the length of the longer string.
  • It is zero if and only if the strings are equal. We will use that distance to create a similarity indicator (between 0 and 1) that is calculated as

(length of longest string - Levenshtein distance between the strings) / length of the longest string. which will yield a value between 0 and 1.

The second interesting metric that we will use is based on the Sørensen-dice coefficient, which also yields a score between 0 and 1.

So let's calculate this using (apoc.text.levenshteinsimilarity)[https://neo4j.com/labs/apoc/4.3/overview/apoc.text/apoc.text.levenshteinSimilarity/] and (apoc.text.sorencenDiceSimilarity)[https://neo4j.com/labs/apoc/4.3/overview/apoc.text/apoc.text.sorensenDiceSimilarity/]. Both of these will yield a value between 0 and 1 - where 0 is very low similarity, and 1 is where the strings are identical.

First let's see what we find, and figure out what would be a useful cutoff point for our dataset:

MATCH (dj1:Dadjoke), (dj2:Dadjoke)
  WHERE id(dj1)<id(dj2)
  AND dj1.Text <> dj2.Text
  AND left(dj1.Text,30) = left(dj2.Text,30)
WITH dj1.Text AS dj1text, dj2.Text AS dj2text, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity,
apoc.text.sorensenDiceSimilarity(dj1.Text, dj2.Text) AS SorensenDiceSimilarity
  WHERE LevenshteinSimilarity < 0.65
RETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity
ORDER BY LevenshteinSimilarity DESC;

String Metrix Levenshtein- and SorensenDice-similarities

Note that I made a conscious decision to compare Dadjokes that

  • had identical first 30 characters
  • had different "complete" text strings We then calculated the string metrics on the entirity of the Dadjoke text, and got the above results.

Then we can decide, by looking at the data, where we should be cutting off the similarity scores. When I looked at this, I judged that if the similarity was greater than 0.7, the two texts of the Dadjoke nodes were likely to be the same. Lower than that and the texts started to diverge quite significantly. So in the next phase I will be merging the jokes that have similarity scores above my cutoff point.

Merge the nodes that have a Levenshtein similarity > 0.7

In case the Levenshtein similarity is larger than 0.7, we would want the Dadjoke nodes to be merged (and the links to the Tweets to be moved as well). That would be an interesting use case for one of the graph refactoring procedures in apoc: apoc.refactor.mergeNodes will do that trick.

:auto MATCH (dj1:Dadjoke), (dj2:Dadjoke)
WHERE id(dj1)<id(dj2)
AND dj1.Text <> dj2.Text
AND left(dj1.Text,20) = left(dj2.Text,20)
WITH [dj1,dj2] AS nodes, apoc.text.levenshteinSimilarity(dj1.Text, dj2.Text) AS LevenshteinSimilarity
WHERE LevenshteinSimilarity > 0.7
CALL {
    WITH nodes
    CALL apoc.refactor.mergeNodes(nodes, {properties: {Text: "discard", SumOfFavorites: "combine", SumOfRetweets: "combine"}, mergeRels:True}) YIELD node
    RETURN count(node) AS mergednodes
    } IN TRANSACTIONS OF 1 ROWS
RETURN sum(mergednodes) as countofmergedbodes;

Refactor the nodes based on LevenShtein similarity

Note that we have to make these changes one by one, based on the fact that it could otherwise lead to a deadlock situation.

Before ending this section, we do need to re-aggregate the Tweet stats onto the Dadjoke nodes. Same procedure as before:

Sum up the Retweet/Favorite stats

MATCH (t:Tweet)--(dj:Dadjoke)
WITH dj, sum(toInteger(t.Favorites)) AS sumoffavorites, sum(toInteger(t.Retweets)) AS sumofretweets
SET dj.SumOfFavorites = sumoffavorites
SET dj.SumOfRetweets = sumofretweets;

That wraps up the disambiguation based on string metrics. Now, let's get to the true power of the graph, but using NLP and Entity Extraction first, and graph analytics second.


A Graph Database and a Dadjoke walk into a bar...

Part 4/6: Adding NLP and Entity Extraction to prepare for further disambiguation

As we can see in the pyjamazon example from before, the disambiguation of our Dadjokes has come a long way - but is not yet complete. Hence we we call the graph to the rescue here, and take it a final step further that will provide a wonderfully powerful example of how and why graphs are so good at analysing the structural characteristics of data, and make interesting and amazing recommendations on the back of that.

Here's what we are going to do:

  1. we are going to use Natural Language Processing to extract the entities that are mentioned in our Dadjokes. Do do that, we are going to use the amazing Google Cloud NLP Service, and call it from APOC. This will yield a connected structure that will tell us exactly which entities are mentioned in every joke.
  2. then we are going to use that graph of dadjokes connected to entities to figure out if the structure of the links can help us with further disambiguation of the jokes.

So let's start with the start.

1. NLP on the Dadjokes

I have actually experimented with this technique before - and found it to be extremely powerful. See some posts that I wrote before: see over here and over here if you want to see other examples. Using the Natural Language Processing (NLP) procedures in the apoc libraries, we can now call the NLP services of GCP, AWS and Azure really easily, and use them to automatically extract the entities from the text of the Dadjokes.

So, after having installed the required NLP .jar file in the Plugin directory of the Neo4j server, we can start analysing the descriptions using the Google Cloud NLP service. Here's how that works:

:param apiKey =>("XYZ-XYZABCXYZABC");

:auto 
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
WITH dj
CALL {
    WITH dj
    CALL apoc.nlp.gcp.entities.graph(dj, {
        key: $apiKey,
        nodeProperty: "Text",
        scoreCutoff: 0.01,
        writeRelationshipType: "HAS_ENTITY",
        writeRelationshipProperty: "gcpEntityScore",
        write: true
        })
    YIELD graph AS g
    RETURN g
    } IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

Note that the apiKey is of course to be replaced with your own personal key.

Run NLP using GCP

Note that we do these NLP operations in transactions of 10 rows: the apoc NLP procedure chunks things into groups of 25, so to reduce the number of (expensive) calls to the GCP services, it's a good idea to use multiples of 25 for the throttling of the calls.

Note that I actually added the

where not((dj)-[:HAS_ENTITY]->())
WITH dj

part because I have had some issues with timeouts recently, where apoc.nlp.gcp.* would just take too long to return and therefore the connection would get reset. So I would have to relaunch the operation, but I would only want to do it for the Dadjoke nodes that had not been processed yet, ie. that would not have Entity nodes connected to it yet. The result then looks something like this for every Dadjoke:

Tweet, Dadjoke and Entities

Two more comments on the NLP topic:

  1. There seem to be some types of text that the Google NLP engine was struggling with, and could not extract entities from. Look at this query:
MATCH (dj:Dadjoke)
WHERE NOT((dj)-[:HAS_ENTITY]->())
RETURN dj.Text

and notice that the results have a lot of plays on words related to numbers:

Dadjokes that GCP NLP struggles with

There could be ways to try to fix that, for example by replacing numbers with words - but I did not go into that.

  1. If you don't have the time, or the $$$, to run the NLP on all these nodes, you could also just do the NLP on a limited subset of nodes:
:auto
MATCH (dj:Dadjoke)
WHERE dj.Text CONTAINS "amazon"
WITH dj
CALL {
    WITH dj
    CALL apoc.nlp.gcp.entities.graph(dj, {
        key: $apiKey,
        nodeProperty: "Text",
        scoreCutoff: 0.01,
        writeRelationshipType: "HAS_ENTITY",
        writeRelationshipProperty: "gcpEntityScore",
        write: true
        })
    YIELD graph AS g
    RETURN g
    } IN TRANSACTIONS OF 10 ROWS
RETURN "Success!";

That really only has a handful of nodes to process, so it would return quite quickly, and the financial cost would negligible.

Indexing the Entities that we extracted

CREATE INDEX entity_index FOR (e:Entity) ON e.text;
CREATE INDEX entity_rel_index FOR ()-[he:HAS_ENTITY]-() ON (he.gcpEntityScore);

Refactoring after NLP-based entity extraction: consolidating the Entity nodes with different capitalisation

After running the NLP procedures, we quickly notice that there is some duplication in the Entity nodes that have been detected. Specifically, the capitalisation of the Entity nodes can be quite confusing and we should take some care to resolve this - which is very easily done.

:auto match (e1:Entity), (e2:Entity)
where id(e1)<id(e2)
and toLower(e1.text) = toLower(e2.text)
call {
  with e1, e2
  match (e2)<-[:HAS_ENTITY]-(dj:Dadjoke)
  create (dj)-[:HAS_ENTITY]->(e1)
  with e1, e2
  call apoc.create.setLabels(e1,labels(e2))
    yield node
  set e1.text = toLower(e1.text)
  detach delete e2
} in transactions of 10 rows
return "Success!";

Refactor the entities to lowercase

We can just quickly explore a specific part of the graph that we have explored before now:

MATCH path = (e:Entity)--(dj:Dadjoke)-[REFERENCES_DADJOKE]-(t:Tweet)--(h:Handle)
WHERE dj.Text CONTAINS "amazon"
    RETURN path;

Dadjokes about Amazon

Note that we now see some notable differences:

  • 2 dadjokes out of 3 mention Jeff's pyjamas, and 1 of the dadjokes mentions pajamas
  • 2 dadjokes out of 3 mention Jeff Bezos, and 1 of the dadjokes mentions eff Bezos
  • there is no overlap between those sets of 2 dadjokes
  • all they have in common is the Bed entity, therefore

This will be of importance when we calculate our next set of similarity scores.


A Graph Database and a Dadjoke walk into a bar...

Part 5/6: Disambiguation using Graph Data Science on the NLP-based Entities

The next, and final, step our dadjoke journey here, is going to be taking the disambiguation to the next level by applying Graph Data Science metrics to the new, enriched (using NLP), structure that we have in our graph database. The basic idea here is that, while the TEXT similarity of these jokes may be quite far apart, their structural similarity may still be quite high based on the connectivity between the joke and its (NLP based) entities.

Calculating the Jaccard similarity metric using GDS

To explore this, we will be using the Jaccard similarity coefficient, which is part of the Neo4j Graph Data Science library that we have installed on our server. More about this coefficient can be found on Wikipedia. The index is defined as the size of the intersection divided by the size of the union of the sample sets, which is very well illustrated on that Wikipedia page. I have used Neuler (the no-code graph data science playground that you can easily add to your Neo4j Desktop installation) to generate the code below - but you can easily run this in the Neo4j Browser as well.

:param limit => ( 500);
:param graphConfig => ({
  nodeProjection: '*',
  relationshipProjection: {
    relType: {
      type: 'HAS_ENTITY',
      orientation: 'NATURAL',
      properties: {}
    }
  }
});
:param config => ({
  similarityMetric: 'Jaccard',
  similarityCutoff: 0,
  degreeCutoff: 2,
  writeProperty: 'score',
  writeRelationshipType: 'JACCARD_SIMILAR'
});
:param communityNodeLimit => ( 10);
:param generatedName => ('in-memory-graph-1663777188212');

CALL gds.graph.project($generatedName, $graphConfig.nodeProjection, $graphConfig.relationshipProjection, {});

CALL gds.nodeSimilarity.write($generatedName, $config);

MATCH (from)-[rel:`JACCARD_SIMILAR`]-(to)
WHERE exists(rel.`score`)
RETURN from, to, rel.`score` AS similarity
ORDER BY similarity DESC
LIMIT toInteger($limit);

CALL gds.graph.drop($generatedName);

Run Jaccard Similarity with GDS

This will generate a set of relationships between Dadjoke nodes that will indicate how similar they are based on a JACCARD_SIMILAR relationship that will have a score weight property on it. We can easily add an index on that relationship:

CREATE INDEX jaccard_similarity_index FOR ()-[s:JACCARD_SIMILAR]-() ON (s.score);

We can explore this similarity with a few queries.

MATCH p=(h:Handle)--(t:Tweet)--(d:Dadjoke)-[r:JACCARD_SIMILAR]->() 
AND d.Text contains "pyjamazon"
RETURN p;

Amazon jokes with Jaccard similarities

Using Jaccard similarity for disambiguation

Returning to our objective of disambiguation of the jokes: there seem to be hundreds of additional disambiguation results that we could eliminate using the Jaccard metric.

MATCH (d1:Dadjoke)-[r:JACCARD_SIMILAR]->(d2:Dadjoke) 
WHERE r.score >0.9
and id(d1)>id(d2)
AND d1.Text contains "Doctor"
RETURN d1.Text, d2.Text, r.score;

Dadjokes with different text but a very high Jaccard Similarity

Turns out there are about 844 of these types of eliminations that we could do.

MATCH p=()-[r:JACCARD_SIMILAR]->() 
WHERE r.score >0.9
WITH count(p) AS count
RETURN count; 

How many jokes with a JACCARD-similarity score over 0.9?

We could then of course also actually perform the disambiguation now and remove the duplicate dadjokes based on the JACCARD_SIMILAR score. I have not done that in this case as I think it is interesting to see how this structural analysis yields it's insights. But clearly that's what you would consider doing as your last disambiguation step, using Neo4j.


A Graph Database and a Dadjoke walk into a bar...

Part 6/6: Closing: some cool Dadjoke Queries

Now that we have a disambiguated graph of dadjokes, let's have some fun and explore it.

How many times does a joke get tweeted?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
WITH dj.Text AS Joke, count(r) AS NrOfTimesTweeted
RETURN Joke, NrOfTimesTweeted
ORDER BY NrOfTimesTweeted DESC
LIMIT 10;

How many times does a joke get favorited?

MATCH ()-[r:REFERENCES_DADJOKE]->(dj:Dadjoke)
RETURN dj.Text AS Joke, dj.SumOfFavorites AS NrOfTimesFavorited, dj.SumOfRetweets AS NrOfTimesRetweeted
ORDER BY NrOfTimesFavorited DESC
LIMIT 10;

Different ways of finding jokes about cars

Let's explore 3 alternative ways to find jokes about cars.

1. Matching the text of the Dadjoke for the word "car"

MATCH (dj:Dadjoke) WHERE dj.Text CONTAINS "car" RETURN dj.Text LIMIT 10;

2. Checking if the Entity contains the word "car"

MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text CONTAINS "car" RETURN dj.Text LIMIT 10;

3. Checking if the Entity equals the word "car"

MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text = "car" RETURN dj.Text LIMIT 10;

Finding jokes about cars and wives

This was another great example:

MATCH p=(h:Handle)--(t:Tweet)--(dj:Dadjoke)-[r:JACCARD_SIMILAR]->() 
WHERE dj.Text CONTAINS "spaghetti" 
    AND (dj.Text CONTAINS "bike" OR dj.Text CONTAINS "car")
    RETURN p;

Jokes about Bikes & Cars made from Spaghetti

It's amazing to see how the same conceptual joke is being reused in different ways!

Some interesting structural characteristics about the #dadjoke twitterspace

Now we can of course also start to look at some of the structural charactersistics of this part of the Twitterspace. Just from looking at some of the subgraph results of our queries, it becomes obvious that

  • lots of jokes are being repeated, time and time again
  • different Twitter handles actually borrow each others jokes - all the time

So let's explore that a little more.

How many jokes are tweeted identically by different tweeters

MATCH path = (h1:Handle)-[*2..2]->(dj:Dadjoke)<-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
RETURN path;

This takes a while to load, but you can clearly see a few cliques in this picture. Paths between twitter Handles

Let's see how many such paths are actually there:

MATCH path = (h1:Handle)-[*2..2]-(dj:Dadjoke)-[*2..2]-(h2:Handle)
WHERE id(h1)<id(h2)
WITH h1.name AS FirstHandle, h2.name AS SecondHandle, count(path) AS NrOfSharedJokes
RETURN FirstHandle, SecondHandle,NrOfSharedJokes
ORDER BY NrOfSharedJokes DESC;

The result is quite enlightning: GroanBot and RandomJokesIO are clearly reinforcing one another. My personal guess is that they are truly just bots.

Count of the paths between twitter Handles

What are the most frequent entities

We already have the Favorite/Retweet scores of all the dadjokes summed up, so we can also look at which Entity nodes have the highest scores that way:

MATCH (e:Entity)--(dj:Dadjoke)
WITH e, sum(toInteger(dj.SumOfFavorites)) AS sumofsumoffavorites, sum(toInteger(dj.SumOfRetweets)) AS sumofsumofretweets
SET e.SumOfSumOfFavorites = sumofsumoffavorites
SET e.SumOfSumOfRetweets = sumofsumofretweets;

This operation finishes very quickly, and so then we can do the exploration quite easily, and figure out what the entities are that our dadjokers are mostly joking about:

MATCH (e:Entity)
RETURN e.text, e.SumOfSumOfFavorites AS EntityFavoriteScore, e.SumOfSumOfRetweets AS EntityRetweetScore
ORDER BY EntityFavoriteScore DESC
LIMIT 10;

What entities are dads joking about?

Surprise: it's about wives and bosses. Right!

Wrapping up

What a crazy ride this has been. I could actually think of many different things that I would want to do with this dataset - but I will leave it at this for now. I do think that this has been one of the best (and most FUN) examples that I have come across recently that combines data import, data wrangling, NLP, text analysis, graph data science and disambiguation in one exercise. I really loved it - and hope it will inspire others to explore this or other datasets in the same graphy way.

Cheers

Rik

{
"title": "Dadjokes",
"version": "2.1",
"settings": {
"pagenumber": 1,
"editable": true,
"fullscreenEnabled": false,
"parameters": {
"neodash_entity_text": "CHARTS"
},
"downloadImageEnabled": true,
"resizing": "all"
},
"pages": [
{
"title": "Start",
"reports": [
{
"title": "Let's start this crazy trip down the dadjoke rathole!",
"query": "![Logo](https://drive.google.com/uc?export=view&id=1Fmv5Ap0IJrUEjpHiYa7p1JxXWgGBvlf0)\n\n\n",
"width": 12,
"height": 2,
"x": 0,
"y": 0,
"type": "text",
"selection": {},
"settings": {
"nodePositions": {}
}
}
]
},
{
"title": "Tweets and Handles ",
"reports": [
{
"title": "Handles tweeting the same text",
"query": "MATCH (t1:Tweet), (t2:Tweet)\nWHERE t1 <> t2\nAND t1.Text = t2.Text\nRETURN t1.`Screen Name`, t2.`Screen Name`, t1.Text;\n",
"width": 12,
"height": 2,
"x": 0,
"y": 0,
"type": "table",
"selection": {},
"settings": {}
}
]
},
{
"title": "Jeff Bezos and his Pijamas",
"reports": [
{
"title": "",
"query": "MATCH path = (dj:Dadjoke)-[*..2]-(conn)\nWHERE dj.Text CONTAINS \"pyjamazon\"\n RETURN path\nlimit 10;\n\n\n",
"width": 12,
"height": 3,
"x": 0,
"y": 0,
"type": "graph",
"selection": {
"Dadjoke": "Text"
},
"settings": {
"nodePositions": {}
}
},
{
"title": "With Entities (after NLP)",
"query": "\nMATCH path = (e:Entity)--(dj:Dadjoke)-[REFERENCES_DADJOKE]-(t:Tweet)--(h:Handle)\nWHERE dj.Text CONTAINS \"amazon\"\n RETURN path;\n\n",
"width": 12,
"height": 2,
"x": 0,
"y": 3,
"type": "graph",
"selection": {
"Entity": "text",
"Person": "text",
"Dadjoke": "Text",
"Tweet": "Text",
"Handle": "name",
"Other": "text"
},
"settings": {
"nodePositions": {}
}
}
]
},
{
"title": "String Metrics",
"reports": [
{
"title": "Levenshtein and Sørensen-Dice Similarity",
"query": "\nMATCH (dj1:Dadjoke), (dj2:Dadjoke)\n WHERE id(dj1)<id(dj2)\n AND dj1.Text <> dj2.Text\n AND left(dj1.Text,30) = left(dj2.Text,30)\nWITH dj1.Text AS dj1text, dj2.Text AS dj2text\nLIMIT 100\nwith dj1text, dj2text, apoc.text.levenshteinSimilarity(dj1text, dj2text) AS LevenshteinSimilarity,\napoc.text.sorensenDiceSimilarity(dj1text, dj2text) AS SorensenDiceSimilarity\n WHERE LevenshteinSimilarity < 0.65\nRETURN left(dj1text,60) as `First 60 chars of dadjoke1`,left(dj2text,60) as `First 60 chars of dadjoke2`,LevenshteinSimilarity,SorensenDiceSimilarity\nORDER BY LevenshteinSimilarity DESC;\n\n",
"width": 12,
"height": 4,
"x": 0,
"y": 0,
"type": "table",
"selection": {},
"settings": {
"nodePositions": {},
"autorun": true
}
}
]
},
{
"title": "Jokes about cars",
"reports": [
{
"title": "Jokes with \"cars\" in the text",
"query": "MATCH (dj:Dadjoke) WHERE dj.Text CONTAINS \"car\" RETURN dj.Text as Dadjoke LIMIT 10;\n\n\n",
"width": 6,
"height": 2,
"x": 0,
"y": 0,
"type": "table",
"selection": {},
"settings": {
"nodePositions": {}
}
},
{
"title": "Jokes with \"car\" in the assosiated entity",
"query": "MATCH (e:Entity)--(dj:Dadjoke) WHERE e.text CONTAINS \"car\" RETURN e.text as Entity, dj.Text as Dadjoke LIMIT 10;",
"width": 6,
"height": 2,
"x": 6,
"y": 0,
"type": "table",
"selection": {},
"settings": {}
},
{
"title": "Dadjoke with entity equal to \"car\"",
"query": "MATCH p = (e:Entity)--(dj:Dadjoke) WHERE e.text = \"car\" RETURN dj.Text as Dadjoke LIMIT 10;\n\n",
"width": 12,
"height": 2,
"x": 0,
"y": 3,
"type": "table",
"selection": {},
"settings": {}
}
]
},
{
"title": "Jokes about spaghetti",
"reports": [
{
"title": "Spaghetti jokes",
"query": "\nMATCH p=(h:Handle)--(t:Tweet)--(dj:Dadjoke)-[r:JACCARD_SIMILAR]->() \nWHERE dj.Text CONTAINS \"spaghetti\" \n AND (dj.Text CONTAINS \"bike\" OR dj.Text CONTAINS \"car\")\n RETURN p;\n\n\n\n\n\n",
"width": 12,
"height": 3,
"x": 0,
"y": 0,
"type": "graph",
"selection": {
"Handle": "name",
"Tweet": "Tweet Id",
"Dadjoke": "(label)"
},
"settings": {
"nodePositions": {}
}
}
]
},
{
"title": "Jokes about entities",
"reports": [
{
"title": "Select the Entity",
"query": "MATCH (n:`Entity`) \nWHERE toLower(toString(n.`text`)) CONTAINS toLower($input) \nRETURN DISTINCT n.`text` as value ORDER BY size(toString(value)) ASC LIMIT 5",
"width": 3,
"height": 3,
"x": 0,
"y": 0,
"type": "select",
"selection": {},
"settings": {
"nodePositions": {},
"type": "Node Property",
"entityType": "Entity",
"propertyType": "text",
"parameterName": "neodash_entity_text"
}
},
{
"title": "",
"query": "WITH [\"car\",\"spaghetti\",\"water\",\"boil\",\"hell\"] AS entities\nMATCH p = (h:Handle)--(t:Tweet)--(dj:Dadjoke)--(e:Entity)\nWHERE e.text IN entities\nRETURN p;\n\n\n\n",
"width": 12,
"height": 3,
"x": 0,
"y": 3,
"type": "graph",
"selection": {
"Handle": "name",
"Tweet": "(label)",
"Dadjoke": "(label)",
"Entity": "(label)",
"ConsumerGood": "(label)",
"Other": "(label)"
},
"settings": {
"nodePositions": {}
}
},
{
"title": "Graph of jokes about the Entity selected",
"query": "MATCH p = (h:Handle)--(t:Tweet)--(dj:Dadjoke)--(e:Entity)\nWHERE e.text = $neodash_entity_text\nRETURN p;\n\n\n\n\n\n",
"width": 9,
"height": 3,
"x": 3,
"y": 0,
"type": "graph",
"selection": {
"Handle": "name",
"Tweet": "Tweet Id",
"Dadjoke": "(label)",
"Entity": "text",
"Other": "text"
},
"settings": {
"nodePositions": {}
}
}
]
},
{
"title": "Dadjoke Twitterverse",
"reports": [
{
"title": "Twitterverse cliques",
"query": "\nMATCH path = (h1:Handle)-[*2..2]->(dj:Dadjoke)<-[*2..2]-(h2:Handle)\nWHERE id(h1)<id(h2)\nRETURN path;\n\n",
"width": 12,
"height": 2,
"x": 0,
"y": 0,
"type": "graph",
"selection": {
"Handle": "name",
"Tweet": "(label)",
"Dadjoke": "(label)"
},
"settings": {
"nodePositions": {}
}
},
{
"title": "Twitterverse stats",
"query": "\nMATCH path = (h1:Handle)-[*2..2]-(dj:Dadjoke)-[*2..2]-(h2:Handle)\nWHERE id(h1)<id(h2)\nWITH h1.name AS FirstHandle, h2.name AS SecondHandle, count(path) AS NrOfSharedJokes\nRETURN FirstHandle, SecondHandle,NrOfSharedJokes\nORDER BY NrOfSharedJokes DESC;\n\n",
"width": 6,
"height": 2,
"x": 0,
"y": 2,
"type": "table",
"selection": {},
"settings": {
"nodePositions": {}
}
},
{
"title": "It's all about...",
"query": "\nMATCH (e:Entity)\nRETURN e.text as Entity, e.SumOfSumOfFavorites AS EntityFavoriteScore, e.SumOfSumOfRetweets AS EntityRetweetScore\nORDER BY EntityFavoriteScore DESC\nLIMIT 10;\n\n\n",
"width": 6,
"height": 2,
"x": 6,
"y": 2,
"type": "table",
"selection": {},
"settings": {
"nodePositions": {}
}
}
]
}
],
"parameters": {}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment