Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save wolfram77/b4316609265b5b9f88027bbc491f80b6 to your computer and use it in GitHub Desktop.
Save wolfram77/b4316609265b5b9f88027bbc491f80b6 to your computer and use it in GitHub Desktop.
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

Community detection methods can be global or local. Global community detection methods divide the entire graph into groups. Existing global algorithms include:

  • Random walk methods
  • Spectral partitioning
  • Label propagation
  • Greedy agglomerative and divisive algorithms
  • Clique percolation

There is a growing body of work in detecting overlapping communities. Seed set expansion is a local community detection method where a relevant seed vertices of interest are picked and expanded to form communities surrounding them. The quality of each community is measured using a fitness function.

Modularity is a fitness function which compares the number of intra-community edges to the expected number in a random-null model. Conductance is another popular fitness score that measures the community cut or inter-community edges. Many overlapping community detection methods use a modified ratio of intra-community edges to all edges with atleast one endpoint in the community.

Andersen et al. use a Spectral PageRank-Nibble method which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a random walk approach in which some vertices in the seed set may not be placed in the final community. Clauset gives a greedy method that starts from a single vertex and then iteratively adds neighboring vertices maximizing the local modularity score. Riedy et al. expand multiple vertices via maximizing modularity.

Several algorithms for detecting global, overlapping communities use a greedy, agglomerative approach and run multiple separate seed set expansions. Lancichinetti et al. run greedy seed set expansions, each with a single seed vertex. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use maximal cliques as seed sets. Havemann et al. greedily expand cliques.

The authors of this paper discuss a dynamic approach for community detection using seed set expansion. Simply marking the neighbours of changed vertices is a naive approach, and has severe shortcomings. This is because communities can split apart. The simple updating method may fail even when it outputs a valid community in the graph.

The local community is chosen for a particular seed set. The task is not simply to find any good community in the graph, but rather the appropriate community for the seed. Changes to the graph may shift the community C to one not centered around the original seed. While C may still have good fitness score, it may not be produced by a complete re-computation using static seed set expansion.

When the algorithm begins, the community initially contains only the seed, and new vertices are then iteratively added. In each iteration, the neighboring vertices of the current community are potential new members and the vertex producing the greatest increase in the fitness score is chosen. The initial computation thus results in an ordered sequence of vertices added to the community and a corresponding sequence of nested sets, each with an increasingly greater fitness score. As the goal is to maintain a community centered around the seed, it is necessary to keep track of the order in which vertices were added.

We require the updated community to contain vertices that, if added one by one as in the static algorithm, result in an increasing sequence. This guarentees that the resulting community remains relevant to the source seed. For each update, we modify the sequence of community members to ensure that the corresponding fitness scores are increasing. After a batch of updates, the algorithm detects any decreases in the sequence of fitness scores and removes vertices from the community to eliminate any such decrease. Next, it checks if any new vertices should be added and updates the community evolution sequence if needed. It is ensured that there are no additional vertices in the graph G whose inclusion in the community would increase the fitness score further.

The steps of the algorithm are as follows.

  1. Fitness scores are obtained for all the new internal and border edges.
  2. Vertices that are endpoints of an updated edge are checked for removal. If a vertex v is removed, the community members are further pruned and the community evolution sequence is updated to reflect the pruning.
  3. Community evolution sequence is scanned to check that the sequence of fitness scores is still monotonically increasing. If a dip exists at a position, then the community evolution sequence is truncated at that position.
  4. The static seed set expansion algorithm is restarted to check whether neighboring vertices should be added to the community.

The authors only consider a single edge update, even though the algorithm can handle batch updates. In order to compare the communities output by the dynamic alogrithm to those from static re-computation, the static algorithm is re-run repeatedly as the graphs are updated. Quality comparison is done through the average ratio of the fitness scores in the dynamic algorithm vs. those obtained by re-computation. In addition, precision and recall are used to compare the members of communities output by the dynamic and static algorithms.

Despite occasional dips, there is no downward trend in either precision or recall, showing that the community quality does not deteriorate over time. In addition to the full dynamic algorithm, the authors also evaluate a modified version, in which selective pruning is not performed. It is observed that the dynamic algorithm is faster than re-computation for batches of around 1000 updates. For larger graphs, this number is likely to be higher.



References

ORG

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment