Skip to content

Instantly share code, notes, and snippets.

@andrefsp
Created March 13, 2013 14:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrefsp/5152491 to your computer and use it in GitHub Desktop.
Save andrefsp/5152491 to your computer and use it in GitHub Desktop.

A general rule of thumb we work with is:

  • Use the Oracle JVM, not OpenJDK. OpenJDK has a lot of issues with large amounts of memory
  • You'll want twice as much memory dedicated to the JVM as the index size is on disk. i.e. if your committed/optimized index is 20GB in size, you'll want 40GB of RAM + a little spare.
  • Use a good garbage collector, e.g.::

    JAVA_OPTS="${JAVA_OPTS} -XX:+UseConcMarkSweepGC -XX:+UseParNewGC"

  • Logging helps a LOT when shit is not working as expected::

    JAVA_OPTS="${JAVA_OPTS} -Xloggc:/var/log/tomcat6/log_GarbageCollection -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"JAVA_OPTS="${JAVA_OPTS} -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/tomcat6/oom_log"

REMEMBER, CACHES NEED RAM. ALLOCATED MORE RAM TO THE JVM AS IS REQUIRED BASED ON YOUR CACHES.

CACHE, CACHE, CACHE. CACHE AS MUCH AS YOU CAN

Result cache

The result cache is a cache of your document IDs, so if you have 20million documents, this cache needs to be big enough to fit all of their IDs. I use a general rule of thumb of figuring out how many bytes a document ID is made up of, pad it and use the sum of all document IDs.

Filter cache

<useFilterForSortedQuery /> needs to be true in conf.

The size should be total of all filters and combinations.

Consider the following, you have some categories and sub categories you can filter on, I'll use 2 mediums and 5 genres.

DVD / Bluray

Comedy / Sci-Fi / Rom Com / Action / XXX

Would be sum(medium * genre), so 10. This means your filter cache results size would be 10.

Document Cache

Greater than max_results * max_concurrent_queries

Commit strategy and soft commits

Solr doesn't like it when you commit to it too much, if you have a look at your /var/lib/solr/data directory when doing a commit you'll gain some insight in to how it works.

You should see something like this:

You have a 20GB index, you do a commit, a second index will appear on disk, it'll grow to >20GB, it will then merge that >20GB index into the existing 20GB index, the folder will begin to shrink in size as the commits merge. Once a commit has completed you'll have an index that is over 20GB, this is the now merged index. Triggering an optimize will cause this to shrink down to ~20GB.

You can use soft commits to combat the issue of updating Solr, we use a combination of time-based automatic commits (i.e. every 24 hours) and soft commits. Soft commits are new in Solr 4, they will commit to memory but not to disk until a <commit /> is triggered. The upside is instance index update, the downside is if your server crashes then the commits in memory are lost.

Notes (you may not like this section)

DON'T USE HAYSTACK. Unless you REALLY know what you're doing and how to hack around it and even then, don't. Haystack does nasty shit like query the database on search, because it returns an ID which is then queried in to an object from the DB. I like Daniel, I have a lot of respect for him, but I hate Haystack.

Work on your schema, it's important, DO NOT simply use the Haystack generated one. Good schema design is as important to Solr as it is to your ORDBM database, probably even more so.

Credits

These are mostly things I've experienced and had the joys of dealing with over the years.

Credit to ranoble, also from Tangent. He's my "oh shit Solr is down/broken/slow" buddy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment