aranm/SocialNetworkFeed

## SocialNetworkFeed
= The Feed is King (or Queen)

When creating a social network type application the feed is king, the most common query that will be run against your service will be a request to get the feed. There are of course many ways to model the feed and a couple of components that are required.

To do a very simple feed one might structure the data as follows:

Users follow other users

[cypher]
----
(n:USER)-[:FOLLOWS]->(m:USER)
----

//hide
[source,cypher]
----
CREATE
(_0:USER { id:1,name:"Bob" }),
(_1:USER { id:2,name:"John" }),
(_2:USER { id:3,name:"Jill" }),
(_3:USER { id:4,name:"Jack" }),
_0-[:FOLLOWS { time:200 }]->_1,
_0-[:FOLLOWS { time:201 }]->_2,
_0-[:FOLLOWS { time:202 }]->_3,
_0-[:FOLLOWS { time:203 }]->_0
----
//graph

Users can then post stuff
[source,cypher]
----
MATCH (n:USER)
WHERE n.id = 4
CREATE n-[:POSTS { time: TIMESTAMP()}]->(p:POST { content: "Hello world", time:1000})
----
//graph


Which after a few posts can end up with a graph like so:
//hide
[source,cypher]
----
MATCH (u:USER)
WHERE u.id = 1
CREATE (u)-[:POSTS]->(:POST { content:"Had no coffee today",time:1001 });

MATCH (u:USER)
WHERE u.id = 2
CREATE (u)-[:POSTS]->(:POST { content:"Have you seen the preview yet?",time:1002});

MATCH (u:USER)
WHERE u.id = 2
CREATE (u)-[:POSTS]->(:POST { content:"Get out AND have a look, it\u0027s fantastic",time:1003});

MATCH (u:USER)
WHERE u.id = 3
CREATE (u)-[:POSTS]->(:POST { content:"My baby just walked for the first time",time:1004 });

MATCH (u:USER)
WHERE u.id = 3
CREATE (u)-[:POSTS]->(:POST { content:"Feeling #great",time:1005 });
----
//graph

The feed query then becomes: (this displays the feed that Bob will see)

[source,cypher]
----
MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.name = 'Bob'
RETURN m as User, p as Post
ORDER BY p.time DESC
----
//table

All this looks well and good, but there will be some problems. The feed is generally returned in reverse date order. When browsing twitter for example we see a list of tweets with the most recent tweet at the top. To load the whole feed for a user would be ridiculous so what happens is the top 20-50 tweets are loaded and when you scroll down more are loaded as you need them. In order to achieve this with the data structure outlined above one would have to structure the queries as follows. I'm going to return 2 at a time as I don't have that many in the db)

[source,cypher]
----
MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.name = 'Bob'
RETURN m as User, p as Post
ORDER BY p.time DESC
LIMIT 2
----
//table

In order to get the next 2 after this we have to match where time is greater than the last post returned by the previous query

[source,cypher]
----
MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.name = 'Bob' AND p.time < 1004 //this time stamp is the last one returned
RETURN m as User, p as Post
ORDER BY p.time DESC
LIMIT 2
----
//table

So there we have a basic feed, it has a few issues however. The order by clause is never cheap. I don't know the internals and how well they are optimised but could imagine that in the worst case scenario the query would have to fetch every single post, order the lot of them and then return a subset. As the feed query will be the most important query we will probably need a more efficient way to get this information.

Enter the linked list.

Instead of each post only being associated with the user who made the post we will create a linked list holding the feed for every user. In order to do this we will create a circular linked list with the user initially structured like so:

[cypher]
----
(n:USER)-[:FEED]->(n)
----

Our initial graph structure will not look that different as the linked list FEED won't have any nodes in it (so it does not show on the graph visualisation)

//hide
[source,cypher]
----
//clear the database
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r;

//now recreate the graph with the new user structure
CREATE
(_0:USER { id:1,name:"Bob" }),
(_1:USER { id:2,name:"John" }),
(_2:USER { id:3,name:"Jill" }),
(_3:USER { id:4,name:"Jack" }),
_0-[:FEED]->_0,
_1-[:FEED]->_1,
_2-[:FEED]->_2,
_3-[:FEED]->_3,
_0-[:FOLLOWS]->_0,
_1-[:FOLLOWS]->_1,
_2-[:FOLLOWS]->_2,
_3-[:FOLLOWS]->_3,
_0-[:FOLLOWS]->_1,
_0-[:FOLLOWS]->_2,
_0-[:FOLLOWS]->_3
----
//graph

Inserting a new post gets a little more complicated

Whenever a user posts we find all of the users followers and add the post into each of their feeds. In other words we do the most work doing the insertion. The insertion is expensive, the feed request should be inexpensive. So when a post is added we add an item to each of the linked lists of the users that are following the poster. Confused? Heres the code.

[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Jack'
CREATE (n)-[:POSTS]->(p:POST { content: 'Hello world', time: 1000 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1000})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed
----
//graph

Woah, that looks complicated, what just happened there? We now have a graph where the one post has been inserted into two users feeds. Every post that a user makes gets inserted into their own feed because all users follow themselves. This may seem counterintuative but it means that you do not have to join users own posts with those people that they follow when getting a users feed (as the users own posts appear in their own feed). Remember what we are aiming for is a fast query to be used when retrieving the feed.

The linked list insertion goes like this:

First we match the head node

[cypher]
----
MATCH followers-[oldFeed:FEED]->after
----

Then we insert a new node

[cypher]
----
CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after
----

Then we delete the old link

[cypher]
----
DELETE oldFeed
----

The reason the structure does not look like this:

[cypher]
----
(n:USER)-[:FEED]->(p:POST)-[:FEED]->after
----

Is that the post will appear in multiple users feeds so we need a pseudo feed item that is specific to each user otherwise when we query the feed the traversal will end up getting mixed through a heap of different users feeds.

After we insert the same posts as used in the first example, the graph will look like this:

[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
CREATE (n)-[:POSTS]->(p:POST { content: 'Had no coffee today', time: 1001 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1001})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;

MATCH (n:USER)
WHERE n.name = 'John'
CREATE (n)-[:POSTS]->(p:POST { content: 'Have you seen the preview yet?', time: 1002 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1002})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;

MATCH (n:USER)
WHERE n.name = 'John'
CREATE (n)-[:POSTS]->(p:POST { content: 'Get out AND have a look, it\u0027s fantastic', time: 1003 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1003})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;

MATCH (n:USER)
WHERE n.name = 'Jill'
CREATE (n)-[:POSTS]->(p:POST { content: 'My baby just walked for the first time', time: 1004 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1004})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;

MATCH (n:USER)
WHERE n.name = 'Jill'
CREATE (n)-[:POSTS]->(p:POST { content: 'Feeling #great', time: 1005 })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1005})-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed;
----
//graph

So now we come to the all important feed query, it ends up looking like this:

[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
MATCH n-[:FEED*]->(feedItem)
WHERE feedItem <> n
WITH feedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
RETURN user as User, post as Post
----
And returns us the same results as the simple graph, except now we have lost the order-by clause.
//table

We also have another improvement. As the posts in each users feed are now ordered we can limit the number of relationships the query will follow like so:

[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
MATCH n-[:FEED*0..2]->(feedItem)
WHERE feedItem <> n
WITH feedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
RETURN user as User, post as Post
----
Which will return the top 2 posts in the feed without loading every item in the feed in order to limit it.
//table

In order to return the next two items I have found it better to give each feed item a unique identifer so I can do an indexed lookup and start from that node on the next retrieval.

[source,cypher]
----
MATCH (n:USER)
WHERE n.name = 'Bob'
WITH n
MATCH (startFeedItem:FEEDITEM)
WHERE startFeedItem.identifier = '1_1004'
WITH startFeedItem, n
MATCH startFeedItem-[:FEED*0..2]->(feedItem)
WHERE feedItem <> startFeedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
RETURN user as User, post as Post
----
//table

Enter the demon in the corner. In order to have Neo4j (or any other storage mechanism) add entries to a linked list you need to take a lock on the node you are adding the next linked node to. If you do not and you get two inserts happening at the same time your linked list will branch with two nodes coming off the one node. I have done pretty extensive testing on this, the posts (as well as the findings) relating to this were asked on the Neo4j mailing list if you are interested.

Taking a lock is possible, the queries do get a lot more complicated, but I am worried about scalability. If a user has 200 friends I will have to take locks on their feeds. That means that if you have two friends that are posting items that should end up in your feed at the same time that a deadlock is going to occur. In certain networks (like twitter) some users have 100,000 followers. In a scenario such as this I would have to take a lock on 100,000 nodes. This means that while the user that has 100,000 friends is posting no one else that is friends with any of those 100,000 people will be able to post without getting a deadlock. This does not seem to be scalable.
	= The Feed is King (or Queen)

	When creating a social network type application the feed is king, the most common query that will be run against your service will be a request to get the feed. There are of course many ways to model the feed and a couple of components that are required.

	To do a very simple feed one might structure the data as follows:

	Users follow other users

	[cypher]
	----
	(n:USER)-[:FOLLOWS]->(m:USER)
	----

	//hide
	[source,cypher]
	----
	CREATE
	(_0:USER { id:1,name:"Bob" }),
	(_1:USER { id:2,name:"John" }),
	(_2:USER { id:3,name:"Jill" }),
	(_3:USER { id:4,name:"Jack" }),
	_0-[:FOLLOWS { time:200 }]->_1,
	_0-[:FOLLOWS { time:201 }]->_2,
	_0-[:FOLLOWS { time:202 }]->_3,
	_0-[:FOLLOWS { time:203 }]->_0
	----
	//graph

	Users can then post stuff
	[source,cypher]
	----
	MATCH (n:USER)
	WHERE n.id = 4
	CREATE n-[:POSTS { time: TIMESTAMP()}]->(p:POST { content: "Hello world", time:1000})
	----
	//graph


	Which after a few posts can end up with a graph like so:
	//hide
	[source,cypher]
	----
	MATCH (u:USER)
	WHERE u.id = 1
	CREATE (u)-[:POSTS]->(:POST { content:"Had no coffee today",time:1001 });

	MATCH (u:USER)
	WHERE u.id = 2
	CREATE (u)-[:POSTS]->(:POST { content:"Have you seen the preview yet?",time:1002});

	MATCH (u:USER)
	WHERE u.id = 2
	CREATE (u)-[:POSTS]->(:POST { content:"Get out AND have a look, it\u0027s fantastic",time:1003});

	MATCH (u:USER)
	WHERE u.id = 3
	CREATE (u)-[:POSTS]->(:POST { content:"My baby just walked for the first time",time:1004 });

	MATCH (u:USER)
	WHERE u.id = 3
	CREATE (u)-[:POSTS]->(:POST { content:"Feeling #great",time:1005 });
	----
	//graph

	The feed query then becomes: (this displays the feed that Bob will see)

	[source,cypher]
	----
	MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
	WHERE n.name = 'Bob'
	RETURN m as User, p as Post
	ORDER BY p.time DESC
	----
	//table

	All this looks well and good, but there will be some problems. The feed is generally returned in reverse date order. When browsing twitter for example we see a list of tweets with the most recent tweet at the top. To load the whole feed for a user would be ridiculous so what happens is the top 20-50 tweets are loaded and when you scroll down more are loaded as you need them. In order to achieve this with the data structure outlined above one would have to structure the queries as follows. I'm going to return 2 at a time as I don't have that many in the db)

	[source,cypher]
	----
	MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
	WHERE n.name = 'Bob'
	RETURN m as User, p as Post
	ORDER BY p.time DESC
	LIMIT 2
	----
	//table

	In order to get the next 2 after this we have to match where time is greater than the last post returned by the previous query

	[source,cypher]
	----
	MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
	WHERE n.name = 'Bob' AND p.time < 1004 //this time stamp is the last one returned
	RETURN m as User, p as Post
	ORDER BY p.time DESC
	LIMIT 2
	----
	//table

	So there we have a basic feed, it has a few issues however. The order by clause is never cheap. I don't know the internals and how well they are optimised but could imagine that in the worst case scenario the query would have to fetch every single post, order the lot of them and then return a subset. As the feed query will be the most important query we will probably need a more efficient way to get this information.

	Enter the linked list.

	Instead of each post only being associated with the user who made the post we will create a linked list holding the feed for every user. In order to do this we will create a circular linked list with the user initially structured like so:

	[cypher]
	----
	(n:USER)-[:FEED]->(n)
	----

	Our initial graph structure will not look that different as the linked list FEED won't have any nodes in it (so it does not show on the graph visualisation)

	//hide
	[source,cypher]
	----
	//clear the database
	MATCH (n)
	OPTIONAL MATCH (n)-[r]-()
	DELETE n,r;

	//now recreate the graph with the new user structure
	CREATE
	(_0:USER { id:1,name:"Bob" }),
	(_1:USER { id:2,name:"John" }),
	(_2:USER { id:3,name:"Jill" }),
	(_3:USER { id:4,name:"Jack" }),
	_0-[:FEED]->_0,
	_1-[:FEED]->_1,
	_2-[:FEED]->_2,
	_3-[:FEED]->_3,
	_0-[:FOLLOWS]->_0,
	_1-[:FOLLOWS]->_1,
	_2-[:FOLLOWS]->_2,
	_3-[:FOLLOWS]->_3,
	_0-[:FOLLOWS]->_1,
	_0-[:FOLLOWS]->_2,
	_0-[:FOLLOWS]->_3
	----
	//graph

	Inserting a new post gets a little more complicated

	Whenever a user posts we find all of the users followers and add the post into each of their feeds. In other words we do the most work doing the insertion. The insertion is expensive, the feed request should be inexpensive. So when a post is added we add an item to each of the linked lists of the users that are following the poster. Confused? Heres the code.

	[source,cypher]
	----
	MATCH (n:USER)
	WHERE n.name = 'Jack'
	CREATE (n)-[:POSTS]->(p:POST { content: 'Hello world', time: 1000 })
	WITH n, p
	MATCH followers-[:FOLLOWS]->n
	WITH followers, n, p
	MATCH followers-[oldFeed:FEED]->after
	CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1000})-[:FEEDPOST]->p
	CREATE followers-[:FEED]->feedItem-[:FEED]->after
	DELETE oldFeed
	----
	//graph

	Woah, that looks complicated, what just happened there? We now have a graph where the one post has been inserted into two users feeds. Every post that a user makes gets inserted into their own feed because all users follow themselves. This may seem counterintuative but it means that you do not have to join users own posts with those people that they follow when getting a users feed (as the users own posts appear in their own feed). Remember what we are aiming for is a fast query to be used when retrieving the feed.

	The linked list insertion goes like this:

	First we match the head node

	[cypher]
	----
	MATCH followers-[oldFeed:FEED]->after
	----

	Then we insert a new node

	[cypher]
	----
	CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after
	----

	Then we delete the old link

	[cypher]
	----
	DELETE oldFeed
	----

	The reason the structure does not look like this:

	[cypher]
	----
	(n:USER)-[:FEED]->(p:POST)-[:FEED]->after
	----

	Is that the post will appear in multiple users feeds so we need a pseudo feed item that is specific to each user otherwise when we query the feed the traversal will end up getting mixed through a heap of different users feeds.

	After we insert the same posts as used in the first example, the graph will look like this:

	[source,cypher]
	----
	MATCH (n:USER)
	WHERE n.name = 'Bob'
	CREATE (n)-[:POSTS]->(p:POST { content: 'Had no coffee today', time: 1001 })
	WITH n, p
	MATCH followers-[:FOLLOWS]->n
	WITH followers, n, p
	MATCH followers-[oldFeed:FEED]->after
	CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1001})-[:FEEDPOST]->p
	CREATE followers-[:FEED]->feedItem-[:FEED]->after
	DELETE oldFeed;

	MATCH (n:USER)
	WHERE n.name = 'John'
	CREATE (n)-[:POSTS]->(p:POST { content: 'Have you seen the preview yet?', time: 1002 })
	WITH n, p
	MATCH followers-[:FOLLOWS]->n
	WITH followers, n, p
	MATCH followers-[oldFeed:FEED]->after
	CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1002})-[:FEEDPOST]->p
	CREATE followers-[:FEED]->feedItem-[:FEED]->after
	DELETE oldFeed;

	MATCH (n:USER)
	WHERE n.name = 'John'
	CREATE (n)-[:POSTS]->(p:POST { content: 'Get out AND have a look, it\u0027s fantastic', time: 1003 })
	WITH n, p
	MATCH followers-[:FOLLOWS]->n
	WITH followers, n, p
	MATCH followers-[oldFeed:FEED]->after
	CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1003})-[:FEEDPOST]->p
	CREATE followers-[:FEED]->feedItem-[:FEED]->after
	DELETE oldFeed;

	MATCH (n:USER)
	WHERE n.name = 'Jill'
	CREATE (n)-[:POSTS]->(p:POST { content: 'My baby just walked for the first time', time: 1004 })
	WITH n, p
	MATCH followers-[:FOLLOWS]->n
	WITH followers, n, p
	MATCH followers-[oldFeed:FEED]->after
	CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1004})-[:FEEDPOST]->p
	CREATE followers-[:FEED]->feedItem-[:FEED]->after
	DELETE oldFeed;

	MATCH (n:USER)
	WHERE n.name = 'Jill'
	CREATE (n)-[:POSTS]->(p:POST { content: 'Feeling #great', time: 1005 })
	WITH n, p
	MATCH followers-[:FOLLOWS]->n
	WITH followers, n, p
	MATCH followers-[oldFeed:FEED]->after
	CREATE (feedItem:FEEDITEM {identifier: followers.id + '_' + 1005})-[:FEEDPOST]->p
	CREATE followers-[:FEED]->feedItem-[:FEED]->after
	DELETE oldFeed;
	----
	//graph

	So now we come to the all important feed query, it ends up looking like this:

	[source,cypher]
	----
	MATCH (n:USER)
	WHERE n.name = 'Bob'
	MATCH n-[:FEED*]->(feedItem)
	WHERE feedItem <> n
	WITH feedItem
	MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
	RETURN user as User, post as Post
	----
	And returns us the same results as the simple graph, except now we have lost the order-by clause.
	//table

	We also have another improvement. As the posts in each users feed are now ordered we can limit the number of relationships the query will follow like so:

	[source,cypher]
	----
	MATCH (n:USER)
	WHERE n.name = 'Bob'
	MATCH n-[:FEED*0..2]->(feedItem)
	WHERE feedItem <> n
	WITH feedItem
	MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
	RETURN user as User, post as Post
	----
	Which will return the top 2 posts in the feed without loading every item in the feed in order to limit it.
	//table

	In order to return the next two items I have found it better to give each feed item a unique identifer so I can do an indexed lookup and start from that node on the next retrieval.

	[source,cypher]
	----
	MATCH (n:USER)
	WHERE n.name = 'Bob'
	WITH n
	MATCH (startFeedItem:FEEDITEM)
	WHERE startFeedItem.identifier = '1_1004'
	WITH startFeedItem, n
	MATCH startFeedItem-[:FEED*0..2]->(feedItem)
	WHERE feedItem <> startFeedItem
	MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-user
	RETURN user as User, post as Post
	----
	//table

	Enter the demon in the corner. In order to have Neo4j (or any other storage mechanism) add entries to a linked list you need to take a lock on the node you are adding the next linked node to. If you do not and you get two inserts happening at the same time your linked list will branch with two nodes coming off the one node. I have done pretty extensive testing on this, the posts (as well as the findings) relating to this were asked on the Neo4j mailing list if you are interested.

	Taking a lock is possible, the queries do get a lot more complicated, but I am worried about scalability. If a user has 200 friends I will have to take locks on their feeds. That means that if you have two friends that are posting items that should end up in your feed at the same time that a deadlock is going to occur. In certain networks (like twitter) some users have 100,000 followers. In a scenario such as this I would have to take a lock on 100,000 nodes. This means that while the user that has 100,000 friends is posting no one else that is friends with any of those 100,000 people will be able to post without getting a deadlock. This does not seem to be scalable.