Skip to content

Instantly share code, notes, and snippets.

@earissola
Last active December 31, 2017 09:17
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save earissola/18e1bce5c4bf851dd444 to your computer and use it in GitHub Desktop.
Save earissola/18e1bce5c4bf851dd444 to your computer and use it in GitHub Desktop.
Sphinx (Open Source Search Server) - Real-Time Indexing and Searching
#
# Sphinx configuration file sample
#
#############################################################################
## index definition
#############################################################################
# realtime index
#
# You can run INSERT, REPLACE, and DELETE on this index on the fly
# using MySQL protocol (see 'listen' directive below)
index rtidx
{
# 'rt' index type must be specified to use RT index
type = rt
# index files path and file name, without extension
# mandatory, path must be writable, extensions will be auto-appended
path = /home/earissola/Universidad/IR/Twitter/sphinx/install/var/data/rtidx
# RAM chunk size limit
# RT index will keep at most this much data in RAM, then flush to disk
# optional, default is 128M
#
rt_mem_limit = 128M
# full-text field declaration
# multi-value, mandatory
rt_field = tweet_str
# unsigned integer attribute declaration
# multi-value (an arbitrary number of attributes is allowed), optional
# declares an unsigned 32-bit attribute
# rt_attr_uint = tweet_id
# For timestamps, Sphinx expects a Unix timestamp expressed as an integer value
# such as 1290375607, not the corresponding “2010-11-22 00:40:07” date and time
# string.
# rt_attr_timestamp = created_at
# RT indexes currently support the following attribute types:
# uint, bigint, float, timestamp, string, mva, mva64, json
#
# rt_attr_bigint = guid
# rt_attr_float = gpa
# rt_attr_timestamp = ts_added
# rt_attr_string = author
# rt_attr_multi = tags
# rt_attr_multi_64 = tags64
# rt_attr_json = extra_data
}
#############################################################################
## searchd settings
#############################################################################
searchd
{
# [hostname:]port[:protocol], or /unix/socket/path to listen on
# known protocols are 'sphinx' (SphinxAPI) and 'mysql41' (SphinxQL)
# In this case Sphinx is listening to requests to its Native API Protocol
# on port 9312 and to MySQL wire protocol on port 9306
listen = localhost:9312:sphinx
listen = localhost:9306:mysql41
# log file, searchd run info is logged here
# optional, default is 'searchd.log'
log = /home/earissola/Universidad/IR/Twitter/sphinx/install/var/log/searchd.log
# query log file, all search queries are logged here
# optional, default is empty (do not log queries)
query_log = /home/earissola/Universidad/IR/Twitter/sphinx/install/var/log/query.log
# client read timeout, seconds
# optional, default is 5
# request timeout, seconds
# optional, default is 5 minutes
read_timeout = 5
# maximum amount of children to fork (concurrent searches to run)
# optional, default is 0 (unlimited)
max_children = 30
# PID file, searchd process ID file name
# mandatory
pid_file = /home/earissola/Universidad/IR/Twitter/sphinx/install/var/log/searchd.pid
# seamless rotate, prevents rotate stalls if precaching huge datasets
# optional, default is 1
seamless_rotate = 1
# whether to forcibly preopen all indexes on startup
# optional, default is 1 (preopen everything)
preopen_indexes = 1
# whether to unlink .old index copies on succesful rotation.
# optional, default is 1 (do unlink)
unlink_old = 1
# multi-processing mode (MPM)
# known values are none, fork, prefork, and threads
# threads is required for RT backend to work
# optional, default is threads
workers = threads # for RT to work
# binlog files path; use empty string to disable binlog
# optional, default is build-time configured data directory
#
binlog_path = /home/earissola/Universidad/IR/Twitter/sphinx/install/var/data # binlog.001 etc will be created there
# RT RAM chunks flush period
# optional, default is 0 (no periodic flush)
#
# rt_flush_period = 900
# maximum RT merge thread IO calls per second, and per-call IO size
# useful for throttling (the background) OPTIMIZE INDEX impact
# optional, default is 0 (unlimited)
#
# rt_merge_iops = 40
# rt_merge_maxiosize = 1M
}
# --eof--

A brief introduction

Sphinx is an open source full text search server, written in C++, that lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files (and also index and search data on the fly - RT Feature). Searching could be achieve via SphinxAPI (implemented in several languages, including Python, C/C++, Java, etc.), or via SphinxQL (MySQL network protocol is supported, so it's possible to search queries expressed in that language).

Real-Time Indexing (RT)

A Real-Time Index is split into two parts: one that always stay in memory, receiving new content; and a second that stays on disk, which is very similar to a plain index in structure. All new data goes to the RAM chunk. The size of this chunk is controlled by the rt_mem_limit configuration option. When this limit is reached, the RAM chunk is flushed to a disk chunk. A disk chunk is just like a plain index, the dictionary and stored attributes will be loaded in memory. After flushing, the RAM chunk is empty and can again be filled with data. The process repeats and a new disk chunk will be created. As we insert more data, more disk chunks will be created.This means the Sphinx daemon will need to hit more files on disk than in a normal plain index, which means more I/O. It then needs to merge results from all the chunks, which translates, in the end, to lower search speeds. This kind of degradation is called, ‘RT index fragmentation‘. In conclusion, the value of rt_mem_limit and the size of the data set will determine how many disk chunks are created. When the index becomes highly fragmented across many disk chunks, performance suffers. It's also important to remark that Sphinx will not use more memory than actually is necessary, so if the RT index only uses 1 MB while the limit is set to 2 GB, it will only consume 1 MB anyway.

Eliminating the I/O problem isn’t everything because searchd still needs to go through several chunks and merge the results — CPU can be a bottleneck. So, realtime indexes lag behind the search speed of a plain index, which consists of a single piece. To bring realtime index performance close to plain index performance, it's necessary to OPTIMIZE. The optimization does nothing more than merge all the disk chunks into one. The operation is quite I/O intensive, as it needs to read all data from a disk chunk, create a temporary chunk (which isn’t searchable) and merge the next chunk into it. After that, the temporary chunk is brought in and the chunks that have been merged are deleted.

Setup and Installation

  • Donwload: Sphinx 2.2.6-release
  • Configure Build: ./configure --prefix=$HOME/install --without-mysql
  • Build: make -j4 install ('-j' options allows to speed up the build)

Basic Configurations and RT Setup

All Sphinx programs require a configuration file, called sphinx.conf by default, which contains different settings, data source declarations, and full-text index declarations. Two sample configuration files, sphinx.conf.dist and sphinx-min.conf.dist, are bundled with Sphinx and are situated in $HOME/install/etc. See the attached sphinx.conf example for better understanding of the corresponding available options.

Searchd

Once all the configuration directives are setup in the corresponding file, it's time to launch the search daemon by executing: $HOME/install/bin/searchd. You can use SphinxQL to dialogue with it: mysql -h0 -P9306. To populate the index with some data:

  • mysql> INSERT INTO rtidx (id, tweet_str) VALUES (1, 'this is a tweet sample long live and prosper');
  • mysql> INSERT INTO rtidx (id, tweet_str) VALUES (2, 'this is another tweet sample may the force be with you always');
  • mysql> INSERT INTO rtidx (id, tweet_str) VALUES (3, 'this is yet another tweet sample simplicity is the ultimate sophistication');
  • mysql> INSERT INTO rtidx (id, tweet_str) VALUES (4, 'final tweet sample live prosper and prosper live');

To perform any lookup over the index simply execute:

  • mysql>SELECT * FROM rtidx WHERE MATCH ('prosper');
  • mysql> SELECT * FROM rtidx WHERE MATCH ('force always'); - AND is the default implicit operator
  • mysql> SELECT * FROM rtidx WHERE MATCH ('live | always');

To see information about the current query:

  • mysql> SHOW META;

Get index information:

  • mysql> SHOW INDEX rtidx STATUS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment