Skip to content

Instantly share code, notes, and snippets.

@rhamlin
Created March 3, 2021 18:35
Show Gist options
  • Save rhamlin/b88666093769161ba88890db36e1edb4 to your computer and use it in GitHub Desktop.
Save rhamlin/b88666093769161ba88890db36e1edb4 to your computer and use it in GitHub Desktop.

Implementing Solr DocValues:

In SOLR, optimize your indexing loads and memory utilization by implementing DocValues.

When storing fields, your fields are stored and indexed separately. When using DocValues, DocValues are much more compact than the stored fields and the term indexes (this can also evidenced by examing the index and stored fields file-sizes). Since DocValues for a single field are stored contiguously, very efficient packing algorithms are used thereby compressing the file more efficiently instead of being sparsely stored like they are when only using stored fields.

Since DocValues are column-oriented fields, meaning the values of DocValue fields are densely packed into columns instead of sparsely stored like they are with stored fields.

Simple example of stored fields vs docValues:

row-oriented (stored fields):

{
  'doc1': {'A':1, 'B':2, 'C':3},
  'doc2': {'A':2, 'B':3, 'C':4},
  'doc3': {'A':4, 'B':3, 'C':2}
}

column-oriented (docValues):

{
  'A': {'doc1':1, 'doc2':2, 'doc3':4},
  'B': {'doc1':2, 'doc2':3, 'doc3':3},
  'C': {'doc1':3, 'doc2':4, 'doc3':2}
}

Details:

When Solr/Lucene returns a set of document ids from a query, it uses the row-oriented (aka: stored fields) view of the documents to retrieve the actual field values. This requires a very few number of seeks since all of the field data will be stored close together in the fields data file. However, when faceting / grouping / filtering / sorting functions are performed, Lucene needs to iterate over every document to collect the field values. This is achieved by uninverting the term index. This performs very well since the field values are already grouped (by nature of the index), but it is slow to load, and is maintained in memory (ie: memory expensive).

When compared to using DocValues, DocValues can load the indexes much faster, and consumes significantly less memory during loading and after garbage-collection. Leveraging DocValues in your Solr schema is more efficient, and also has the potential for increasing the number of fields you can facet / group / filter / sort on without increasing your memory requirements (i.e. scalability)


HEAP usage using DocValues (versus not using them) from our internal benchmark testing:

doc-Values-memory


A caveat to using DocValues is that the size of your indexes will increase because it adds the DocValues themselves to the indexing. If you can provide your current index sizes, that would be helpful in understanding what we're looking at to begin with. You will also need to make sure you have available storage space for the increase in index size.

How to include DocValues in your Solr Schemas

Below is an example (snippet) of some of the field declarations in a schema file:

<fields>
  <field indexed="true" multiValued="false" name="feeMargin" stored="true" type="DecimalStrField"/>
  <field indexed="true" multiValued="false" name="contractSubType" stored="true" type="StrField"/>
  <field indexed="true" multiValued="false" name="tradeDate" stored="true" type="TrieDateField"/>
  <field indexed="true" multiValued="false" name="status" stored="true" type="StrField"/>
  <field indexed="true" multiValued="false" name="economics" stored="true" type="BinaryField"/>
  <field indexed="true" multiValued="false" name="snapshotVersion" stored="true" type="TrieLongField"/>
<fields>

To implement DocValues you will need to modify the field declarations as follows: (note the docValues="true" addition to each field declaration)

<fields>
  <field indexed="true" multiValued="false" name="feeMargin" stored="true" docValues="true" type="DecimalStrField"/>
  <field indexed="true" multiValued="false" name="contractSubType" stored="true" docValues="true"  type="StrField"/>
  <field indexed="true" multiValued="false" name="tradeDate" stored="true" docValues="true"  type="TrieDateField"/>
  <field indexed="true" multiValued="false" name="status" stored="true" docValues="true" type="StrField"/>
  <field indexed="true" multiValued="false" name="economics" stored="true"  docValues="true" type="BinaryField"/>
  <field indexed="true" multiValued="false" name="snapshotVersion" stored="true"  docValues="true" type="TrieLongField"/>
<fields>

NOTE:

  1. docValues should NOT be used on fields that have the type definition: type=TextField

  2. Once your schemas have been modified to include docValues, you will need to re-index, which may take some time (again, depending on the size and number of indexes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment