Skip to content

Instantly share code, notes, and snippets.

@ikwattro
Last active December 2, 2022 17:49
Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save ikwattro/071d36f135131e8e4442 to your computer and use it in GitHub Desktop.
Save ikwattro/071d36f135131e8e4442 to your computer and use it in GitHub Desktop.
Github Events Analysis with Neo4j

Github Events Analysis with Neo4j

Imgur

On July 22, Github announced the 3rd Annual Github Data Challenge presenting multiple sources of data available.

This sounded to me a good opportunity to use their available data and import it in Neo4j in order to have a lot of fun at analyzing the data that fits naturally in a graph.

As I work mainly offline or behind military proxies that do not permit me to use the ReST API, I decided to go for the Github Archive available here, you can then download json files representing Github Events on a daily/hour basis.

Be aware that hourly files represent approximately 8000 events !

Each file contain lines of Github Events encoded in JSON. There are a bunch of EventTypes like ForkEvent, IssueCommentEvent, ReleaseEvent, DeleteEvent, WatchEvent, and so on ...

You can find the complete list of EventTypes and Payloads here.

For my part, I will use 4 EventTypes : ForkEvent, PushEvent, PullRequestEvent & IssueCommentEvent.

This maybe looks little but if like me you import all these 4 events for a period of 1 month, you'll end up with about 2 million nodes created and 4 times more relationships.

Importing the data

I've created a simple library in PHP that parses these data files and import the Events into the database.

Each Event is handled with a dedicated EventTypeLoader that produce the Cypher Query for the import.

The library is available here : https://github.com/kwattro/gh4j

I would like to clear up that the way events are imported are proper to my intentions of data manipulating, you can always modify the code to reflect your needs.

Also, be aware that it is not compatible with the ReST API events as the payload are totally different.

Informations specific to all Events

There is Common Payload for all Events, the User creating the event and the EventType, each Event will create a Node having the name of the event as a Neo4j label.

The PullRequestEvent

By examinating the PullRequestEvent Payload, you can recreate the complete schema, starting from the user creating the PullRequest ending to who is owning the Repository on which the PR is opened.

You can also guess when the PR is closed if it is a MERGE/CLOSE or a single CLOSE by looking at the merged_at key.

The PullRequestEventLoader will import the payload in the following schema :

PullRequestSchemaImage

The Fork Event

Same here, ForkEvent Payload gives you some informations but not so much, for e.g. this does not give you the id of the newly created Repository (a Fork is also a repository).

So in order to have a uniqueness reference for matching/creating/retrieving forks, I decided to use html urls of the Fork/Repository as an identifier and this information is available in the payload.

Loading the ForkEvent will produce the following schema :

ForkEventSchema

The PushEvent

As the ForkEvent, the PushEvent does not provide quite useful informations, you can still build a little schema.

This will produce the following Neo4j Graph Schema :

PushEventSchema

The IssueCommentEvent

And the last one, IssueCommentEvent. There is a check in the code of the IssueCommentEventLoader that detect if the Issue on which the comment is done is related to a PullRequest, this will when true create a BOUND_TO_PR relationship from Issue to PR.

Giving up the schema :

IssueCommentEventSchema

Let's play

Ok, the boring part of explaining how the data is imported is done. If you've uploaded a bit data for 1 or 2 days you can end up with already a really intersting graph.

Time is now to leverage Neo4j godness called Cypher and start having fun (means querying the graph) :)

Basic queries

Who did the most events ?

MATCH (u:User)-[r:DO]->()
RETURN u.name, count(r) as events
ORDER BY events DESC
LIMIT 1
---------
u.name	events
imclab	273
Returned 1 row in 2755 ms

Which repository has been the most touched

MATCH (repo:Repository)<-[r]-()
RETURN repo.name, count(r) as touchs
ORDER BY touchs DESC
LIMIT 1
---------
repo.name	touchs
dotfiles	624
Returned 1 row in 3105 ms

Which repository has been the most forked

MATCH (repo:Repository)<-[:FORK_OF]-(fork:Fork)<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1
---------
repo.name	forks
malloclab	81
Returned 1 row in 558 ms

Which repository has the most merge PR's

MATCH (repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)<-[merge:PR_MERGE]-()
RETURN repo.name, count(merge) as merges
ORDER BY merges DESC
LIMIT 1
---------
repo.name	merges
homebrew-cask	19
Returned 1 row in 329 ms

As you can see, it is simply with Cypher as you just need to draw on a paper or whiteboard the path you want to return.

Ok we go a bit further now :

Less-basic queries

(At least for my level ;-) )

How many Forks are resulting in an Opened PR ?

You just need to analyse the nodes and connections between the ForkEvent and the PR Event and translate it into Cypher.

Here is a visual representation of the complete Fork -> PR_OPEN scheme :

ForkToPRScheme

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)
-[:FORK_OF]->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)
-[:PR_OPEN]-(pre:PullRequestEvent)<-[:DO]-(u2:User)<-[:OWNED_BY]-
(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)
----------
count(p)
1410
Returned 1 row in 1484 ms

What is the average time in seconds between a Fork is done and a PR is opened?

There is not much changes with the previous query, we just use the avg function on the difference between the PR Opening time and the Fork time

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]
->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)-[:PR_OPEN]-(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p), avg(pre.time - fe.time) as offsetTime
---------
count(p)	offsetTime
1410	    3593.6758865248225
Returned 1 row in 1114 ms

We can extend the query to know the average time until the PR is merged.

What is the average time in seconds between a Fork and a PR is merged?

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)-[:FORK_OF]->
(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest {state:'merged'})-[:PR_OPEN]-(pre:PullRequestEvent)
<-[:DO]-(u2:User)<-[:OWNED_BY]-(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
<-[:PR_CLOSE]-(pr3:PullRequestEvent)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p) as matchedPaths, avg(pr3.time - fe.time) as offsetTime
---------
matchedPaths	offsetTime
408	          8177.544117647059
Returned 1 row in 928 ms

What is the average number of comments on a PR before the PR is merged?

MATCH p=(ice:IssueCommentEvent)-[:ISSUE_COMMENT]->(comment:IssueComment)
-[:COMMENT_ON]->(issue:Issue)-[:BOUND_TO_PR]->(pr:PullRequest)
<-[:PR_MERGE]-(pre:PullRequestEvent)
WHERE ice.time <= pre.time
WITH pr, count(comment) as comments
RETURN avg(comments)
---------
avg(comments)
2.0440251572327046
Returned 1 row in 713 ms

Which User has the most MERGED PR's on Repositories not owned by him?

MATCH (u:User)-[r:DO]->(fe:PullRequestEvent)-[:PR_OPEN]->(pr:PullRequest {state:'merged'})
-[:PR_ON_REPO]-(repo:Repository)-[:OWNED_BY]->(u2:User)
WHERE NOT u = u2
RETURN u.name, count(r) as prs
ORDER BY prs DESC
LIMIT 1
--------
u.name	prs
adamralph	14
Returned 1 row in 365 ms

Relate together Users having Merged PR's on same repositories, could serve as Follow Recommendations Engine.

This will also create a weight property on the relationship incremented each time a relation between the 2 users is found

MATCH p=(u:User)-[:DO]-(e:PullRequestEvent)-->(pr:PullRequest {state:'merged'})-[:PR_ON_REPO]->(r:Repository)<-[:PR_ON_REPO]-(pr2:PullRequest {state:'merged'})--(e2:PullRequestEvent)<-[:DO]-(u2:User)
WHERE NOT u = u2
WITH nodes(p) as coll
WITH head(coll) as st, last(coll) as end
MERGE (st)-[r:HAVE_WORKED_ON_SAME_REPO]-(end)
ON MATCH SET r.w = (r.w) + 1
ON CREATE SET r.w = 1
---------
Created 1122 relationships, returned 0 rows in 2334 ms

RecomUsers

@kbastani
Copy link

Very impressive. Thanks for writing this up so thoroughly.

For this query:

Which repository has been the most forked

MATCH (repo:Repository)<-[:FORK_OF]-(fork:Fork)<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1

You can remove the middle identifier.

MATCH (repo:Repository)<-[:FORK_OF]-()<-[:FORK]-(event:ForkEvent)
RETURN repo.name, count(event) as forks
ORDER BY forks DESC
LIMIT 1

Assuming that the FORK_OF relationship type and FORK relationship type only connect to nodes with the label Fork. Trivial feedback here, as the change doesn't provide a performance benefit, but it reads nicer.

For this query:

MATCH p=(u:User)-[:DO]->(fe:ForkEvent)-[:FORK]->(fork:Fork)
-[:FORK_OF]->(repo:Repository)<-[:PR_ON_REPO]-(pr:PullRequest)
-[:PR_OPEN]-(pre:PullRequestEvent)<-[:DO]-(u2:User)<-[:OWNED_BY]-
(f2:Fork)<-[:BRANCH_OF]-(br:Branch)<-[:FROM_BRANCH]-(pr2:PullRequest)
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)

You can separate the long pattern into a set of anchored patterns like this:

MATCH p=(u:User)-[:DO]->(fe:ForkEvent),
                  (fe)-[:FORK]->(fork:Fork),
                  (fork)-[:FORK_OF]->(repo:Repository),
                  (repo)<-[:PR_ON_REPO]-(pr:PullRequest),
                  (pr)-[:PR_OPEN]-(pre:PullRequestEvent),
                  (pre)<-[:DO]-(u2:User),
                  (u2)<-[:OWNED_BY]-(f2:Fork),
                  (f2)<-[:BRANCH_OF]-(br:Branch),
                  (br)<-[:FROM_BRANCH]-(pr2:PullRequest),
WHERE u = u2 AND fork = f2 AND pr = pr2
RETURN count(p)

There are multiple ways to optimize the performance of this query. By first matching on users who both forked and opened a pull request. This will provide a limited starting criteria to prevent a global graph pattern matching operation. Check out my blog post on Neo4j Cypher Query Evaluation: http://www.kennybastani.com/2014/07/understanding-how-neo4j-cypher-queries.html

Amazing work here again. I'd love to see this put into a user interface. Take a look at a project I built to visualize behavioral analytics using charts and time series: http://meetup-analytics-dashboard.herokuapp.com/ and https://github.com/kbastani/meetup-analytics

Cheers!

Kenny

@ikwattro
Copy link
Author

Thanks for your inputs Kenny and for the link about cypher queries. I love also the anchored patterns way, so much more readable !

I really want to go further with the current setup, like building a timetree graph and relate the events to it and also add actor location on maps.
I'll also try to upload the database to one of my servers iot to provide data visualisation online.

Thanks again for following this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment