szeitlin/infection_graphgist.adoc

## infection_graphgist.adoc

      
    Raw
  

              infection_graphgist.adoc
            
          
    Mapping Infected Users On Khan Academy


Introduction


Using a graph model to represent Users on the site, we can test what happens when we roll out changes so that associated Users will all see the same version of the site. In other words, we can make changes such that one User gets 'infected' with a new version of the site, and some or all of their connected Users will receive that same version.


This could be interpreted as a version of what Graph Theory calls Transitive Closure, which is a way of propagating modified attributes to all reachable nodes from a given modified node.


Graphgists are great, because instead of just reading a static page, you can actually view the graph and play with cypher!


To view the interactive version of this page, including an interactive cypher console, paste the url of this github gist into the box at the top right-hand corner of the page you’ll land on after clicking here:http://gist.neo4j.org


This graphgist shows how to create nodes and relationships using cypher for neo4j, and how to assign a property to Users to indicate if they have been Infected with a new version of the site.


Thanks to neo4j, it’s only a minor adjustment to limit how many connected users are Infected, after specifying the node of interest and the path or depth of spread.


Initial Data Model


At the beginning, the database only consists of User nodes with [:coaches] relationships,
such that (User)-[:coaches ]->(User)


These will be examples for testing.


Different types of user-user relationships are represented, i.e:


-unconnected nodes (students studying alone)
-unidirectional single relationship (user coaches another user)
-bidirectional single relationship (users working together)
-linear but longer (one user helps another user who helps another user)
-branching (1 coach, 2 students)
-multilevel branched (1 coach, 4 students)
-multibranching (3 coaches, each with 1 or more students)


Initial Data Setup


First, create the nodes and relationships for each different type of cluster.


//a few lonely nodes who prefer to study on their own
CREATE
(A0:User {name:'Oscar'}),
(A1:User {name:'Snuffleupagus'});

//unidirectional single relationship
CREATE
(A:User {name:'A'}),
(B:User {name:'B'}),
(A) -[:coaches]-> (B);

//bidirectional single relationship
CREATE
(C:User {name:'C'}),
(D:User {name:'D'}),
(C) -[:coaches ]-> (D),
(D) -[:coaches ]-> (C);

//linear but longer
CREATE
(E:User {name:'E'}),
(F:User {name:'F'}),
(G:User {name:'G'}),
(H:User {name:'H'}),
(E) -[:coaches ]-> (F),
(F) -[:coaches ]-> (G),
(G) -[:coaches ]-> (H);

//branching
CREATE
(J:User {name:'J'}),
(K:User {name:'K'}),
(L:User {name:'L'}),
(J) -[:coaches ]-> (K),
(J) -[:coaches ]-> (L);

//multilevel
CREATE
(M:User {name:'M'}),
(N:User {name:'N'}),
(P:User {name:'P'}),
(Q:User {name:'Q'}),
(R:User {name:'R'}),
(M) -[:coaches {Infected:0}]-> (N),
(N) -[:coaches {Infected:0}]-> (P),
(M) -[:coaches {Infected:0}]-> (Q),
(Q) -[:coaches {Infected:0}]-> (R);

//multibranching
CREATE
(S:User {name:'S'}),
(T:User {name:'T'}),
(V:User {name:'V'}),
(W:User {name:'W'}),
(X:User {name:'X'}),
(Y:User {name:'Y'}),
(Z:User {name:'Z'}),
(SZ:User {name:'SZ'}),
(S) -[:coaches {Infected:0}]-> (T),
(T) -[:coaches {Infected:0}]-> (V),
(S) -[:coaches {Infected:0}]-> (W),
(W) -[:coaches {Infected:0}]-> (X),
(Y) -[:coaches {Infected:0}]-> (S),
(Y) -[:coaches {Infected:0}]-> (W),
(W) -[:coaches {Infected:0}]-> (Z),
(W) -[:coaches {Infected:0}]-> (SZ);

MATCH n RETURN n;


Finding specific users (nodes) or clusters


After creating the nodes and relationships above, I used the dummy variable n to mean 'all nodes'.
So in cypher that’s MATCH n RETURN n; to view all the nodes.


To get all nodes connected to any other nodes, I can use two dummy variables and a linker, which tells cypher that I don’t care about the direction of the relationship. In this case, it should return all the nodes except for our two students who are working alone.


MATCH (n) -- (m)
RETURN n;


To find the entirety of one cluster, we can specify the path using the name of the User to Infect first. In this case, since we only have one type of relationship, [:coaches], we can use the asterisk to mean "find all relationships from W to any other connected User".


MATCH path = (a {name:'W'}) - [*] - (n)
RETURN path;


To find only one part of the cluster, we can set a limit for the number of Users:


MATCH path = (a {name:'W'}) - [*] - (n)
RETURN a,n
LIMIT 3;


Infecting Users with Change!


Now, we want to change a property about all the Users linked to another User of our choice, or "Infect" them with our great ideas!


First, we find the path that indicates all Users connected to our User of interest, and then set a new property that indicates whether that User has been Infected or not (we’ll use 1 to mean True). Note that in this case we’re using two dummy variables again, where (a) is for the first Infected person, and (n) is all the connected Users who will be Infected by being associated with them.


//infect all Users connected to a particular User

MATCH path = (a {name:'W'}) - [*] - (n)
SET a.Infected = 1
SET n.Infected = 1
RETURN a,n;


Infect only a subset of Users


Finally, to limit the number of Users that are Infected, there are several different ways to interpret how to do that.


One simple way is to set the depth of the path, such that the distance of relationship 'hops' away from a target User is limited by a specific number or range of numbers, as in the example below, where we targeted Users that are 2-3 hops away. This way is very easy, but it doesn’t care if any students are left out.


// limit path depth

MATCH path = (a {name:'W'}) - [*2..3] - (n)
SET n.Infected=1
RETURN a,n;


Alternatively, we can specify a limit on the number of Users, rather than the depth of the path. But again, doing it that way doesn’t guarantee that nobody gets left out.


//limit number of nodes changed

MATCH path = (a {name:'G'}) - [*] - (n)
SET n.Infected=1
RETURN a,n
LIMIT 2;


Ideally, if we have a much bigger set of Users than the small test groups shown here, we’d want to try to keep everyone in a group working together.


Then we might want to find natural cluster sizes first, and then split up the clusters, using something like this more sophisticated approach.


Live Cypher Console


About me


GitHub


Blog


LinkedIn