Skip to content

Instantly share code, notes, and snippets.

@mschoch
Created August 5, 2016 19:32
Show Gist options
  • Save mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2 to your computer and use it in GitHub Desktop.
Save mschoch/5cc5c9cf4669a5fe8512cb7770d3c1a2 to your computer and use it in GitHub Desktop.

Object Allocation and Copying during Search

An analysis of the objects needed to perform a search, when they come into existence and when they become garbage.

Example Case

We will consider a search for documents containing cat or dog in a field body. This search is complex enough to illustrate some issues not visible with a simpler search, but still small enough to approach.

It is assumed that the field body has field id of 0.

Tiers of Functionality

The hierarchy of objects doing the actual work are organized as seen below. Note, there are some intermediary structures like the IndexReader and KV Reader which are used to build the Searchers and Iterators respectively, but they are not central to this discussion.

  • TopScoreCollector (top 10 hits)
    • DisjunctionSearcher (cat or dog in field body)
      • TermSearcher (cat in field body)
        • TermFieldReader (cat in field body)
          • KV PrefixIterator (t00cat)
      • TermSearcher (dog in field body)
        • TermFieldReader (dog in field body)
          • KV PrefixIterator (t00dog)

High Level Information Flow

If we ignore the implementation details and just discuss how data flows through these structures we see the following.

  • KV Iterators give access to the key and value currently being pointed to as []byte.

  • The TermFieldReader parses the underlying key/value bytes and returns a structured representation called a TermFieldDoc. This structure says that in field F term T occurs in document D.

  • The TermSearcher examines the TermFieldDoc in the context of the entire index, and creates a new structure known as a DocumentMatch. At that stage it may appear that the DocumentMatch simply reproduces the information already in the TermFieldDoc, but one new aspect appears, a field known as the Score. This is a measure of how well this document matches the search. The score could only be calculated at this point, because it required additional information about the whole index, that wasn't available to the TermFieldReader.

  • The DisjunctionSearcher is responsible for operating two TermSearcher instances simultaneously. This now makes the real purpose of the DocumentMatch structure more apparent. It not just adding a score to a TermFieldDoc, it allows the aggregation of multiple reasons that a particular document matched the search. In this case, if a document used cat and dog, a single DocumentMatch will be returned which includes the locations of cat and dog.

  • The TopScoreCollector is responsible for keeping track of the top N (10) hits that it has seen thus far. So, in this case, at any one time it has references to at most 10 DocumentMatches. But, exactly when a particular DocumentMatch will become garbage is unknown.

Naïve implementation

For the simplest case, assume everything be returned is a brand new allocation. Next, let's assume there are 100 documents using the term cat, 200 documents using the term dog, and 50 of those are the same document.

Thus, the allocations might look like:

  • TopScoreCollector returns 10 DocumentMatchs
    • DisjunctionSearcher returns 200 DocumentMatchs
      • TermSearcher (cat) returns 100 DocumentMatchs
        • TermFieldReader (cat) returns 100 TermFieldDoc
          • KV PrefixIterator (t00cat) returns 100 []byte keys
      • TermSearcher (dog in field body) returns 200 DocumentMatchs
        • TermFieldReader (dog) returns 200 TermFieldDocs
          • KV PrefixIterator (t00dog) returns 200 []byte keys

So, in total we allocated 510 DocumentMatchs, 300 TermFieldDocs and 300 []bytes.

So, we can see how this can be problem. We are allocating a lot of objects, and it's not bounded by the structure of the query, but instead by the volume of items matching the query. This means that the larger the dataset gets, or the more things the search matches, the slower the search will perform.

Slightly More Real World

That oversimplified model is not quite how things actually work but was presented for clarity. Two details are worth clarifying.

  1. The TopScoreCollector doesn't really bother allocating new DocumentMatches, it knows its the end of the line, and just keeps references to items returned to it by the searcher.

  2. The KV Iterator API does not copy bytes for us. The existing API is very clear that you cannot use those bytes outside the current state of the iterator (calls to Next()/Seek()/Close() will reuse/destroy the bytes)

So, a slightly updated model looks like:

  • TopScoreCollector holds references to the top 10 DocumentMatchs
    • DisjunctionSearcher returns 200 DocumentMatchs
      • TermSearcher (cat) returns 100 DocumentMatchs
        • TermFieldReader (cat) returns 100 TermFieldDoc and 100 []bytes
          • KV PrefixIterator (t00cat) 1 []byte reused constantly
      • TermSearcher (dog in field body) returns 200 DocumentMatchs
        • TermFieldReader (dog) returns 200 TermFieldDocs and 100 []bytes
          • KV PrefixIterator (t00dog) 1[]byte reused constantly

Not a huge change, but some important differences to understand before moving on. Most notably, the TermFieldReader is now allocating byte slices AND copying data from the iterator byte slice into the newly allocated one.

Improvement, TermFieldReader can reuse TermFieldDoc

The observation is that a TermFieldReader really only ever needs to manage one TermFieldDoc at a time. This works because the TermFieldReader will be converting it into a DocumentMatch anyway.

But, there is another catch, the []byte in the TermFieldDoc representing the documents internal identifier must be copied again, if we are to allow full reuse of the already allocated TermFieldDoc.

So, the upside of doing this is that the TermFieldReader will be able to do just a single allocation for a reusable TermFieldDoc, and a (hopefully) single allocation for the []byte representing the document internal identifier. We say hopefully because in the event larger document identifiers are encountered we may need to grow the slice. (Or we could always allocate some default size we hope to be sufficient...)

So, with this change the model looks like:

  • TopScoreCollector holds references to the top 10 DocumentMatchs
    • DisjunctionSearcher returns 200 DocumentMatchs
      • TermSearcher (cat) returns 100 DocumentMatchs
        • TermFieldReader (cat) 1 TermFieldDoc and 1 []bytes reused constantly
          • KV PrefixIterator (t00cat) 1 []byte reused constantly
      • TermSearcher (dog in field body) returns 200 DocumentMatchs
        • TermFieldReader (dog) returns 1 TermFieldDocs and 1 []bytes reused constantly
          • KV PrefixIterator (t00dog) 1[]byte reused constantly

So, now the PrefixIterator and TermFieldReader have 0 allocations in the hot path. But, we are copying the same data twice now. Once from the iterator into the TermFieldDoc, and again from the TermFieldDoc into the DocumentMatch.

Another Improvement, Collectors/Searchers Only Require Fixed Number of DocumentMatch

In the simplest case, Searchers only ever return 1 DocumentMatch at a time (via a call to Next or Advance). In the more complex cases, Searchers may need to track multiple DocumentMatches at the same time, but this is always a fixed number, having to do with the structure of the query.

For example, a TermSearcher has no references to other Searchers, so only has to worry about the DocumentMatches that it is returning. If every call to Next/Advance provided an already allocated DocumentMatch, it can simply fill it, as opposed to allocating something new.

In a more complex case like the DisjunctionSearcher, it's internal state requires it to track N document matches simultaneously. And N is the number of disjunction clauses. In this example case it is 2. So, it could preallocate 2 DocumentMatch instances. When it calls Next/Advance on the child TermSearchers, it simply provides references to these already allocated structures.

At the top-level, the TopScoreCollector knows it only ever needs to remember 11 DocumentMatches. Why 11? Well it has to remember the top 10 currently seen, and then one more for the next incoming result. When a DocumentMatch falls out of the top 10, that instance can be reused.

All of this sounds great, but there is another catch. Because each of these pre-allocated DocumentMatches has a []byte corresponding to the internal document identifier, it too would have to be copied so that these buffers can be reused without requiring additional allocation each time.

Let's assume for now we do that, this is what it looks like:

  • TopScoreCollector allocates 11 DocumentMatchs
    • DisjunctionSearcher allocates 2 DocumentMatchs
      • TermSearcher (cat) allocates 1 DocumentMatchs
        • TermFieldReader (cat) 1 TermFieldDoc and 1 []bytes reused constantly
          • KV PrefixIterator (t00cat) 1 []byte reused constantly
      • TermSearcher (dog in field body) allocates 1 DocumentMatchs
        • TermFieldReader (dog) returns 1 TermFieldDocs and 1 []bytes reused constantly
          • KV PrefixIterator (t00dog) 1[]byte reused constantly

As you can see we now allocate much fewer DocumentMatches, TermFieldDocs, and []byte.

But... there is a new cost, at each step of the way we had to copy the []byte from one DocumentMatch to another. This is because the DocumentMatches are no longer flowing through the system, but are placeholders through which we copy data.

My current thinking is that this is still a win. The memory required to perform a search is much lower, and its relative to the complexity of the query, not amount of data matched. And while there is additional CPU/memory access cost of copying these []bytes, that cost too is proportional to the complexity of the query. Simple term searches have 2 copies (from Iterator to TermFieldDoc and from TermFieldDoc to DocumentMatch). The query we're looking at would have 2 for each of the underlying term searchers, plus one additional copy in the DisjunctionSearcher.

Can we still do better?

The final observation is that the number of []byte's actually required is still fixed. What if we were able to pass these []bytes through the system (instead of copying), but still reuse them.

Technical Aside: How would that work? Instead of copying the []byte from upstream DocumentMatch to downstream DocumentMatch, we would simply copy the reference. The upstream reference would be set to nil, so that it cannot be copied into further upstream without reallocation. Which wouldn't happen until the very end of the upstream, because the whole chain would be copying references, not data. At the very upstream, it would either need to do allocation (bad, want to avoid) or get it from a pool or freelist somehow.

The challenge here is that we would need brand new []byte in the TermFieldReader, but these slices end their lifetime at a variety of places. For example, in the DisjuctionSearcher, sometimes both TermSearchers return a DocumentMatch pointing to the same internal index identifier. Only one need flow up into the DocumentMatch being returned, the other is now eligible for reuse.

Further, some of the index internal identifiers that flow all the way up to the TopScoreCollector eventually get bumped out of the top 10, and at that time, those []byte would be come eligible for reuse.

One candidate solution is the use of sync.Pool to track index internal []byte that can be reused. Sync pool offers a thread-safe free list, that is hooked into the go routine, allowing unused items to be reclaimed automatically. However, the thread-safety, though optimized still requires the use of locks, and so it is not recommended for pooling objects with a "short life". Of course no definition of "short life" is provided. That being said, previous attempts to use sync pool have not panned out, so we suspect our use in this case might also be "too short lived".

So, the next idea would be for us to have our own free-list. In theory it could avoid any locking since currently all searches take place in a single go routine (the multiple iterators are iterated lock-step, not concurrently).

@mschoch
Copy link
Author

mschoch commented Aug 5, 2016

So, thinking about this a bit more while I ate lunch...

What if the index package had a new structure (it goes in the index package because search already imports index, not the other way, and both will need it)

Something like:

type IndexInternalIDPool struct{}
func (i *IndexInternalIDPool) Get() IndexInternalID {}
func (i *IndexInternalIDPool) Put(IndexInternalID) {}

Then, we'd update the Searcher API, to have a method like:

EstimateIDCount() int

A terrible name, but the idea is that searcher figures out how many IDs it might need and adds it to what all it's children searchers might need, and you get some total hopefully enough count of how many IDs you'll be managing concurrently.

Then it will call index.NewInternalIndexIDPool(count).

This will preallocate. There are a couple of ways to do this, that will require further discussion. The point is we do hopefully one large upfront allocation, assuming some typical "max ID size". If we underestimate that, performance should degrade, not break.

Now, we have to pass a reference to this pool throughout (unfortunate), but at the very bottom, where all ID's are first copied out of iterators, we call Get(). Then anytime we stop using an ID, we pass it to Put().

I don't think we require locks inside this structure because the entire Search operation takes place in one goroutine.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment