sandeepone/open-source-search-compare.md

## open-source-search-compare.md

      
    Raw
  

              open-source-search-compare.md
            
          
    Open Source Search Comparison

Elasticsearch


Main Web: http://www.elasticsearch.org/
Development URL: https://github.com/elasticsearch/elasticsearch
License: Apache 2
Environment: Java

Elasticsearch was created in 2010 by Shay Banon after forgoing work on another search solution, Compass, also built on Lucene and created in 2004.
Marketing Points


Real time data, analytics
Distributed, scaled horizontally. Add nodes for capacity.
High availability, reorganizing clusters of nodes.
Multi-tenancy. Multiple indices in a cluster, added on the fly.
Full text search via Lucene. Most powerful full text search capabilities in any open source product
Document oriented. Store structured JSON docs.
Conflict management
Schema free with the ability to assign specific knowledge at a later time
Restful API
Document changes are recorded in transaction logs in multiple nodes.

Technical Info


Built on Lucene
Data is stored with PUT and POST requests and retrieved with GET requests. Can check for existence of a document with HEAD requests. JSON documents can be deleted with DELETE requests.
Requests can be made with JSON query language rather than a query string.
Indexed documents are versioned. (Uunique feature?)
Full text docs are stored in memory. A new option in 1.0 allows for doc values which are stored on disk.a
Suggesters are built in to suggest corrections or completions.
Plugin system available for custom functionality.
Possible admin interface via Elastic-HQ
Elasticsearch in Production is a great article on some of the realities faced when running Elasticsearch.
Securing your Elasticsearch cluster
Plugins available for authentication.
Why We Built Elasticsearch - dotScale presentation from the creator, Shay Banon
GitHub's transition from Solr to Elasticsearch


we quickly exceeded the volume, just literally the storage space that one Solr cluster and Solr instance could handle.


Many great Elasticsearch articles by Greg Brown.

Sphinx


Main Web: http://sphinxsearch.com/
Development URL: http://sphinxsearch.com/bugs/my_view_page.php
License: GPLv2
Environment: C++

Sphinx was created in 2001 by Andrew Aksyonoff to solve a personal need for search solution and has remained a standalone project.
Marketing Points


Supports on the fly (real time) and offline batch index creation.
Arbitrary attributes can be stored in the index.
Can index SQL DBs
Can batch index XMLpipe2 and (?) tsvpipe documents
3 different APIs, native libraries provided for SphinxAPI
DB like querying features.

Technical Info


Real time indexes can only be populated using SphinxQL
Disk based indexes can be built from SQL DBs, TSV, or custom XML format.
Example PHP API file to be included in projects communicating with Sphinx.
Uses fsockopen in PHP to make a connection with the Sphinx service similar to how a MySQL connection would be made.
Various Sphinx articles

Solr


Main Web: http://lucene.apache.org/solr/
Development URL: https://issues.apache.org/jira/browse/SOLR
License: Apache 2
Environment: Java

Solr was created in 2004 at CNet by Yonik Seeley and granted to the Apache Software Foundation in 2006 to become part of the Lucene project.
Marketing Points


Rest-like API
Documents added via XML, JSON, CSV, or binary over HTTP.
Query with GET and receive XML, JSON, CSV, or binary results.
XML configuration
Extensible plugin architecture
AJAX based admin interface

Technical Info

Misc Documents


Introduction to Information Retrieval
A detailed series on Solr vs Elasticsearch

(1) Overview, (2) Data Handling, (3) Searching, (4) Faceting, (5) Management API, and (6) User/Dev Communities


Comparison between Solr and Sphinx
Comparison of full text search engines
Choosing a stand-alone full-text search server: Sphinx or SOLR

Misc Thoughts and Opinions

These thoughts and opinions were mostly formed during the creation of this document while researching various search solutions.

Elasticsearch provides a RESTful API endpoint for all requests from all languages. Sphinx provides language specific wrappers for the API to communicate with the service.
It seems more straightforward to push arbitrary documents and schema via JSON at Elasticsearch than to create fields as Sphinx requires. I'm not entirely sure on this point yet.
Sphinx is definitely designed around a SQL type structure, though it has been modified over time to support other data stores. I think this could be an issue.
That Elasticsearch is developed on GitHub is a big positive for me. The combined interfaces of MantisBT and Google's code repository is a little annoying.
Decisions like implementing xmlpipe2 and tsvpipe by Sphinx as data sources are somewhat confusing. I think the standard formats offered with Solr and Elasticsearch make more sense.
Elasticsearch was built to be real time from the beginning. Solr is near real-time. Sphinx started as a batch indexer and moved (rightly) to real time over time. See Sphinx real time caveats.
I'm a fan of this: one can launch ElasticSearch and start sending documents to it in order to have them indexed without creating any sort of index schema and ElasticSearch will try to guess field types.