Skip to content

Instantly share code, notes, and snippets.

@aranm
Last active August 29, 2015 14:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aranm/8d55c23a84d9c94dd4f2 to your computer and use it in GitHub Desktop.
Save aranm/8d55c23a84d9c94dd4f2 to your computer and use it in GitHub Desktop.
= The Feed is King (or Queen)
When creating a social network type application the feed is king, the most common query that will be run against your service will be a request to get the feed. There are of course many ways to model the feed and a couple of components that are required.
To do a very simple feed one might structure the data as follows:
Users follow other users
[cypher]
----
(n:USER)-[:FOLLOWS]->(m:USER)
----
//hide
[source,cypher]
----
CREATE
(_0:USER { id:1,name:"Bob" }),
(_1:USER { id:2,name:"John" }),
(_2:USER { id:3,name:"Jill" }),
(_3:USER { id:4,name:"Jack" }),
_0-[:FOLLOWS { time:200 }]->_1,
_0-[:FOLLOWS { time:201 }]->_2,
_0-[:FOLLOWS { time:202 }]->_3,
_0-[:FOLLOWS { time:203 }]->_0
----
//graph
Users can then post stuff
[source,cypher]
----
MATCH (n:USER)
WHERE n.id = 4
CREATE n-[:POSTS { time: TIMESTAMP()}]->(p:POST { content: "Hello world", time:1000})
----
//graph
Which after a few posts can end up with a graph like so:
//hide
[source,cypher]
----
MATCH (u:USER)
WHERE u.id = 1
CREATE (u)-[:POSTS]->(:POST { content:"Had no coffee today",time:1001 });
MATCH (u:USER)
WHERE u.id = 2
CREATE (u)-[:POSTS]->(:POST { content:"Have you seen the preview yet?",time:1002});
MATCH (u:USER)
WHERE u.id = 2
CREATE (u)-[:POSTS]->(:POST { content:"Get out AND have a look, it\u0027s fantastic",time:1003});
MATCH (u:USER)
WHERE u.id = 3
CREATE (u)-[:POSTS]->(:POST { content:"My baby just walked for the first time",time:1004 });
MATCH (u:USER)
WHERE u.id = 3
CREATE (u)-[:POSTS]->(:POST { content:"Feeling #great",time:1005 });
----
//graph
The feed query then becomes: (this displays the feed that Bob will see)
[source,cypher]
----
MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.name = 'Bob'
RETURN m as User, p as Post
ORDER BY p.time DESC
----
//table
All this looks well and good, but there will be some problems. The feed is generally returned in reverse date order. When browsing twitter for example we see a list of tweets with the most recent tweet at the top. To load the whole feed for a user would be ridiculous so what happens is the top 20-50 tweets are loaded and when you scroll down more are loaded as you need them. In order to achieve this with the data structure outlined above one would have to structure the queries as follows. I'm going to return 2 at a time as I don't have that many in the db)
[source,cypher]
----
MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.name = 'Bob'
RETURN m as User, p as Post
ORDER BY p.time DESC
LIMIT 2
----
//table
In order to get the next 2 after this we have to match where time is greater than the last post returned by the previous query
[source,cypher]
----
MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.name = 'Bob' AND p.time < 1004 //this time stamp is the last one returned
RETURN m as User, p as Post
ORDER BY p.time DESC
LIMIT 2
----
//table
So there we have a basic feed, it has a few issues however. The order by clause is never cheap. I don't know the internals and how well they are optimised but could imagine that in the worst case scenario the query would have to fetch every single post, order the lot of them and then return a subset. As the feed query will be the most important query we will probably need a more efficient way to get this information.
Enter the linked list.
Instead of each post only being associated with the user who made the post we will create a linked list holding the feed for every user. In order to do this we will create a circular linked list with the user initially structured like so:
[cypher]
----
(n:USER)-[:FEED]->(n)
----
Our initial graph structure will not look that different as the linked list FEED won't have any nodes in it (so it does not show on the graph visualisation)
//hide
[source,cypher]
----
//clear the database
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r;
//now recreate the graph with the new user structure
CREATE
(_0:USER { id:1,name:"Bob" }),
(_1:USER { id:2,name:"John" }),
(_2:USER { id:3,name:"Jill" }),
(_3:USER { id:4,name:"Jack" }),
_0-[:FEED]->_0,
_1-[:FEED]->_1,
_2-[:FEED]->_2,
_3-[:FEED]->_3,
_0-[:FOLLOWS]->_0,
_1-[:FOLLOWS]->_1,
_2-[:FOLLOWS]->_2,
_3-[:FOLLOWS]->_3,
_0-[:FOLLOWS]->_1,
_0-[:FOLLOWS]->_2,
_0-[:FOLLOWS]->_3
----
//graph
Inserting a new post gets a little more complicated
Whenever a user posts we find all of the users followers and add the post into each of their feeds. In other words we do the most work doing the insertion. The insertion is expensive, the feed request should be inexpensive. So when a post is added we add an item to each of the linked lists of the users that are following the poster. Confused? Heres the code.
[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Jack'
CREATE (n)-[:POSTS]->(p:POST { content: 'Hello world', time: 1000 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1000})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed
----
//graph
Woah, that looks complicated, what just happened there? We now have a graph where the one post has been inserted into two users feeds. Every post that a user makes gets inserted into their own feed because all users follow themselves. This may seem counterintuative but it means that you do not have to join users own posts with those people that they follow when getting a users feed (as the users own posts appear in their own feed). Remember what we are aiming for is a fast query to be used when retrieving the feed.
The linked list insertion goes like this:
First we match the head node
[cypher]
----
MATCH followers-[oldFeed:FEED]->after
----
Then we insert a new node
[cypher]
----
CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after
----
Then we delete the old link
[cypher]
----
DELETE oldFeed
----
The reason the structure does not look like this:
[cypher]
----
(n:USER)-[:FEED]->(p:POST)-[:FEED]->after
----
Is that the post will appear in multiple users feeds so we need a pseudo feed item that is specific to each user otherwise when we query the feed the traversal will end up getting mixed through a heap of different users feeds.
After we insert the same posts as used in the first example, the graph will look like this:
[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
CREATE (n)-[:POSTS]->(p:POST { content: 'Had no coffee today', time: 1001 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1001})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;
MATCH (n:USER)
WHERE n.name = 'John'
CREATE (n)-[:POSTS]->(p:POST { content: 'Have you seen the preview yet?', time: 1002 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1002})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;
MATCH (n:USER)
WHERE n.name = 'John'
CREATE (n)-[:POSTS]->(p:POST { content: 'Get out AND have a look, it\u0027s fantastic', time: 1003 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1003})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;
MATCH (n:USER)
WHERE n.name = 'Jill'
CREATE (n)-[:POSTS]->(p:POST { content: 'My baby just walked for the first time', time: 1004 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1004})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;
MATCH (n:USER)
WHERE n.name = 'Jill'
CREATE (n)-[:POSTS]->(p:POST { content: 'Feeling #great', time: 1005 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1005})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;
----
//graph
So now we come to the all important feed query, it ends up looking like this:
[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
MATCH n-[:FEED*]->(feedItem)
WHERE feedItem <> n
WITH feedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
RETURN user as User, post as Post
----
And returns us the same results as the simple graph, except now we have lost the order-by clause.
//table
We also have another improvement. As the posts in each users feed are now ordered we can limit the number of relationships the query will follow like so:
[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
MATCH n-[:FEED*0..2]->(feedItem)
WHERE feedItem <> n
WITH feedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
RETURN user as User, post as Post
----
Which will return the top 2 posts in the feed without loading every item in the feed in order to limit it.
//table
In order to return the next two items I have found it better to give each feed item a unique identifer so I can do an indexed lookup and start from that node on the next retrieval.
[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
WITH n
MATCH (startFeedItem:FEEDITEM)
WHERE startFeedItem.identifier = '1_1004'
WITH startFeedItem, n
MATCH startFeedItem-[:FEED*0..2]->(feedItem)
WHERE feedItem <> startFeedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
RETURN user as User, post as Post
----
//table
Enter the demon in the corner. In order to have Neo4j (or any other storage mechanism) add entries to a linked list you need to take a lock on the node you are adding the next linked node to. If you do not and you get two inserts happening at the same time your linked list will branch with two nodes coming off the one node. I have done pretty extensive testing on this, the posts (as well as the findings) relating to this were asked on the Neo4j mailing list if you are interested.
Taking a lock is possible, the queries do get a lot more complicated, but I am worried about scalability. If a user has 200 friends I will have to take locks on their feeds. That means that if you have two friends that are posting items that should end up in your feed at the same time that a deadlock is going to occur. In certain networks (like twitter) some users have 100,000 followers. In a scenario such as this I would have to take a lock on 100,000 nodes. This means that while the user that has 100,000 friends is posting no one else that is friends with any of those 100,000 people will be able to post without getting a deadlock. This does not seem to be scalable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment