amontalenti/lucene.md

## lucene.md

      
    Raw
  

              lucene.md
            
          
    Lucene Fundamentals

A useful set of Lucene fundamentals that are good for grok'ing Elasticsearch.
Jargon Glossary


document: a record; the unit of search; the thing returned as search results
field: a typed slot in a document for storing and indexing values
index: a collection of documents, typically with the same field mappings or schema
corpus: the entire set of documents in an index
term: a value extracted from the source documents used for building the inverted index
inverted index: Lucene's internal data structure that maps terms to documents by ID
uninverted index: aka field data, a data structure that contains a list of all field values per field
doc values: an alternative way of storing the uninverted index on disk rather than in-memory

OK, that gets some jargon out of the way.
Let's start simple: two documents, doc 1 & 2, both contain the text "big data", and doc3 contains "small data".
Inverted Index

Instead of storing:
doc1={"tag": "big data"}
doc2={"tag": "big data"}
doc3={"tag": "small data"}

We can store the "inverted index" instead:
//inverted index for tag field//
"big"=doc1,doc2
"data"=doc1,doc2,doc3
"small"=doc3

Now, when I search for "big", I get back doc1, doc2. I search for "small", I get back doc3. I search for "data", I get back all documents. This is basically the core data structure in Lucene and in search in general.
Terms

In the above documents, I have 3 "terms" if I were doing basic tokenization. "big", "small", and "data".
If I were to not analyze those fields, I'd have 2 terms: they'd be "big data", "small data".
If I use a multi-field, I can store this data both ways.
Notice, if I decide not to analyze a field, but I decide to store it (in Lucene "stored" field or in Elasticsearch "_source" field), then I am essentially storing the data twice. Once, in the inverted index, and once in "storage", as well.
Terms are interesting when you have data that repeats frequently among your documents. For example, imagine the same corpus as above, but where you have 1,000 documents with exactly the same values. In this case you might have:
"data"=1,2,3,...,1000
"big"=1,3,5,7,9,...,999

In this case, the inverted index would only store the terms "big" and "data" once each, but it would store the doc ids repeatedly (which should be cheaper than storing the whole document repeatedly). This listing of terms => docids is also sometimes called the "postings" list.
Numeric Terms and Term Ranges

The reason fields need to have types is because the way we index field values into terms can dramatically affect how we can query those fields.
Text is not the only thing that gets broken into terms -- numeric and date field values can, as well. This is a bit mind-bending, as terms feel like a "texty" concept.
Here's a snippet from Lucene in Action on the topic: "If you indexed your field with NumericField, you can efficiently search a particular range for that field using NumericRangeQuery. Under the hood, Lucene translates the requested range into the equivalent set of brackets in the previously indexed trie structure."
What Trie structure? Well, Lucene creates one on-the-fly for numerics. From Lucene's Javadocs: "Numerical values in a special string-encoded format with variable precision -- all numerical values like doubles, longs, floats, and ints are converted to lexicographic sortable string representations and stored with different precisions."
A range is then divided recursively into multiple intervals for searching. The center of the range is searched only with the lowest possible precision in the trie, while the boundaries are matched more exactly.
As the Javadocs explain, the idea is to reduce the discrete numeric data set to a number of "term ranges". As they say, "This reduces the number of terms dramatically. For the variant that stores long values in 8 different precisions (each reduced by 8 bits) that uses a lowest precision of 1 byte, the index contains only a maximum of 256 distinct values in the lowest precision. Overall, a range could consist of a theoretical maximum of 72552 + 255 = 3825 distinct terms - when there is a term for every distinct value of an 8-byte-number in the index and the range covers almost all of them. A maximum of 255 distinct values is used because it would always be possible to reduce the full 256 values to one term with degraded precision. In practice, we have seen up to 300 terms in most cases -- e.g. index with 500,000 metadata records and a uniform value distribution."
Pretty magical, huh? The clever thing here is that because numeric values can be converted to "term ranges", this same magic works on dates, as well. The dates are converted to numbers, the numbers are then converted into term ranges. Thus, even though you might be searching through 1 million "minutes" of data, you would only be searching through a few hundred "minute ranges" in the inverted index. If you need a visual explanation of numeric range queries, check out this slideshow.
Since each bracket is a distinct term, all Lucene has to do is to "OR" them together to query a larger and larger range. The number of brackets required will be relatively small vs the total number of field values -- this is what gives it such good performance.
Uninverted Index

So, let's dig into a supposed "pageviews" field. We might have values ranging from 1 to 10 million in there. On a segment-per-segment basis, we have an array that looks something like this:
pageviews=[1,1000,5000,1000000,123,...]

If you need to execute a sum() on this array, Elasticsearch will basically do an in-memory array sum. If you need to execute a sum() on some subset of this array, it's going to use bitsets for filtering down the array to use.
Let's say that you want to only sum pageviews from documents that match a specific author. In this case, the full array of pageviews on that segment will be compared against a bitset that might look as follows:
pageviews=[1,1000,5000,1000000,123,...]
specific_author=[0,1,0,1,0,...]
filtered_views=[0,1000,0,1000000,0,...]

as you can see, the pageviews array was gated through the specific_author bitset (a saved filter) and the result was an array of filtered views. This might even be a sparse array, where most of the values are 0 and the only actual values come from matching documents. But in any case, this can be done very efficiently in-memory and the result is a filtered sum that matches what we need exactly.
This makes it clear why it's valuable to have the uninverted index (field data, field cache) in-memory.
But, that means you need to store every field value in memory. Which leads us to doc values.
Doc Values

Storing every single field value in memory is fast, but prohibitive. Doc Values is basically a Lucene hack that takes advantage of Cassandra-style "columnar" data storage, but stores it all on-disk.
Let's look back at our pageviews array. Rather than storing:
pageviews=[1,1000,5000,1000000,123,...]

in-memory, it will create a file that basically just has the values splatted out in columnstride format:
1
1000
5000
1000000
123
...

When the segment needs to perform a sum() on this data, it will simply do a straightforward sequential scan of the file. Though it won't put all the data in memory, this scan will signal to the Linux kernel that the disk data is hot, and Linux will automatically use page cache to store the disk blocks. To quote a kernel developer, "For example, when you read a 100-megabyte file twice, once after the other, the second access will be quicker, because the file blocks come directly from the Page Cache in memory and do not have to be read from the hard disk again."