Skip to content

Instantly share code, notes, and snippets.

@shaldengeki
Created December 25, 2013 18:29
Show Gist options
  • Save shaldengeki/8125720 to your computer and use it in GitHub Desktop.
Save shaldengeki/8125720 to your computer and use it in GitHub Desktop.
Sabretooth and LlamaGuy's descriptions of how tagd works on ETI

Sabretooth

Anyway, we do use a custom service written in C++ that holds all of the tag data in-memory. Each tag holds ordered set of topic IDs. The order is by last post time and then by topic ID for tie breaking; this ordering is consistent between different tags, which gets important later. If you iterate through one of these sets you'll get all the topics tagged with that tag, most recently updated to least recently updated.

Tags also define and implement a very simple cursor interface. This interface is basically three functions; "advance the cursor by one topic", "advance the iterator until it's pointing to or past this topic (passing in a reference topic)", and "give me the currently pointed-to topic". When you are talking to the service, you can say "skip 50 topics and then give me 50 topics", and it will advance the cursor 50 times and throw away the result, and then advance the cursor 50 more times while returning each result. Pretty straightforward.

The cursor interface is also implemented by a couple of meta-cursors which implement different kinds of set operations. Each of these meta-cursors is constructed around a list of cursors, and implements a basic set operation on the results returned from them; union, difference, or intersection.

It turns out that implementing these set operator cursors is very cheap because of the restrictive interface (allowing only forward movement and no accurate counting). You don't need to gather all of the data before performing the set operation; you can do it completely on-the-fly. A couple of examples:

Union cursor:
    Current topic: Find the "first" (lowest ordered) topic pointed to by any of the sub-iterators and return that.
    Advance: Advance any of the sub-iterators that are pointing to the current topic

Intersection cursor: 
    Find the highest ordered topic pointed to by any sub-cursor; call this the reference topic. 
    Advance all other cursors until they point to (or after) the reference topic. 
    If any other cursor ends up pointing after the reference topic, then use that topic as the new reference topic and start over. 
    Once all cursors point at the reference topic, that's the current topic.

When you request a topic list, your tag expression is compiled into a graph of tag and set operation cursors, which is then consumed to get the appropriate data.

LlamaGuy

the whole thing is just built on top of std::set<>.. there's nothing too fancy that happens in the tagd code. tags, full-text, topics of the moment, expressions, etc is all under 1k lines of code with another 400 lines in network code. create a new set<> for each tag, and the rest is just iterating through sets in different ways. turns out computers are very good at merging sorted lists together.

the key is std::set's lower_bound() function. that way in the case of "Starcraft & LUE" what basically ends up happening is you iterate through [Starcraft] calling lower_bound() on [LUE] with each result you got from [Starcraft]. if lower_bound() returns the same topic, you have a match. otherwise go to the next [Starcraft] topic and continue.

there's also an optimization where if a tag contains more than half all of topics ([LUE], [Archived]) then it also keeps a set<> for topics that are not tagged with that tag. then if you do -LUE it's fast, instead of doing [Everything] - [LUE]. there's some other cases where using the inverse list is faster as well.

oh yeah and to estimate the count of total topics in an expression it manually iterates through the first 2500 topics, takes the delta of time between the first topic found (most recent) and the 2500th topic, and then keeps exponentially fast-forwarding (using lower_bound()) which gives a reasonable estimate. this assumes that the distribution of topics over time is fairly constant, which isn't always true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment