Skip to content

Instantly share code, notes, and snippets.

@mjamesruggiero
Created November 19, 2013 23:40
Show Gist options
  • Save mjamesruggiero/7554595 to your computer and use it in GitHub Desktop.
Save mjamesruggiero/7554595 to your computer and use it in GitHub Desktop.
MLConf 2013 notes (3)

Joseph Gonzalez, Co-Founder at GraphLab

Graphs encode the relationship between entities and are essential to data mining and machine learning.

Example: predicting user behavior

  • as a base technique that does a classification re: political bias
  • you can estimate the political bias and take that idea and apply it to a large user base
  • the vast majority of the users may not post frequently, but you have the follower structure, the graph, so you can use a conditional random field

"triangle counting"

  • count triangles passing through each vertex
  • measures "cohesiveness" of a local community
  • real estate agents are the "highest" triangles

How should we program these? Everyone reaches for hadoop

  • it has a very good parallelizing primitive, but it's row-based, not graph-based
  • how do you do a dependency graph?
  • how do you iterate on a graph with partitioning?

Pregel - Google's graph parallel abstraction

  • a vertex program runs on each vertex
  • that program is constrained to only interact with nodes that share its vertices

Example: noise reduction in image files

  • Most people use adaptive belief propagation
  • the problem with noisy images is that the boundaries are hard
  • "This is easy to reason about with grid-like graphs, but in the real world graphs are governed by power laws and high-degree vertices"
  • hard to partition high-degree vertices

Graphlab (aka "the pitch")

  • Graphlab splits vertices rather than edges
  • they have some (presumably proprietary) heuristics for splitting vertices
  • unlike Pregel, which uses uses message-passing, Graphlab uses a shared state model

Page rank example

  • Alex Smola's technique can process 150 million tokens per second;
  • the Yahoo altavista web graph: 1B links processed per second 30 lines of user code
  • Graphlab ~100 tokens per second.... but the code is 30 lines written in 4 human-hours

Python libraries

  • they've built a series of python notebooks; you can apply them to real data
  • Their python tools work as a packaged set of binaries so that you don't need to compile the C++
  • you can do local prototyping but also deploy the same setup to your EC2 cluster
  • toolkits customized for different use cases

Further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment