Skip to content

Instantly share code, notes, and snippets.

@jeremyfelt
Last active July 19, 2023 01:28
Show Gist options
  • Star 69 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save jeremyfelt/8230088 to your computer and use it in GitHub Desktop.
Save jeremyfelt/8230088 to your computer and use it in GitHub Desktop.
Comparing open source search solutions

Open Source Search Comparison

Elasticsearch

Elasticsearch was created in 2010 by Shay Banon after forgoing work on another search solution, Compass, also built on Lucene and created in 2004.

Marketing Points

  • Real time data, analytics
  • Distributed, scaled horizontally. Add nodes for capacity.
  • High availability, reorganizing clusters of nodes.
  • Multi-tenancy. Multiple indices in a cluster, added on the fly.
  • Full text search via Lucene. Most powerful full text search capabilities in any open source product
  • Document oriented. Store structured JSON docs.
  • Conflict management
  • Schema free with the ability to assign specific knowledge at a later time
  • Restful API
  • Document changes are recorded in transaction logs in multiple nodes.

Technical Info

  • Built on Lucene
  • Data is stored with PUT and POST requests and retrieved with GET requests. Can check for existence of a document with HEAD requests. JSON documents can be deleted with DELETE requests.
  • Requests can be made with JSON query language rather than a query string.
  • Indexed documents are versioned. (Uunique feature?)
  • Full text docs are stored in memory. A new option in 1.0 allows for doc values which are stored on disk.a
  • Suggesters are built in to suggest corrections or completions.
  • Plugin system available for custom functionality.
  • Possible admin interface via Elastic-HQ
  • Elasticsearch in Production is a great article on some of the realities faced when running Elasticsearch.
  • Securing your Elasticsearch cluster
  • Plugins available for authentication.
  • Why We Built Elasticsearch - dotScale presentation from the creator, Shay Banon
  • GitHub's transition from Solr to Elasticsearch
    • we quickly exceeded the volume, just literally the storage space that one Solr cluster and Solr instance could handle.
  • Many great Elasticsearch articles by Greg Brown.

Sphinx

Sphinx was created in 2001 by Andrew Aksyonoff to solve a personal need for search solution and has remained a standalone project.

Marketing Points

  • Supports on the fly (real time) and offline batch index creation.
  • Arbitrary attributes can be stored in the index.
  • Can index SQL DBs
  • Can batch index XMLpipe2 and (?) tsvpipe documents
  • 3 different APIs, native libraries provided for SphinxAPI
  • DB like querying features.

Technical Info

  • Real time indexes can only be populated using SphinxQL
  • Disk based indexes can be built from SQL DBs, TSV, or custom XML format.
  • Example PHP API file to be included in projects communicating with Sphinx.
  • Uses fsockopen in PHP to make a connection with the Sphinx service similar to how a MySQL connection would be made.
  • Various Sphinx articles

Solr

Solr was created in 2004 at CNet by Yonik Seeley and granted to the Apache Software Foundation in 2006 to become part of the Lucene project.

Marketing Points

  • Rest-like API
  • Documents added via XML, JSON, CSV, or binary over HTTP.
  • Query with GET and receive XML, JSON, CSV, or binary results.
  • XML configuration
  • Extensible plugin architecture
  • AJAX based admin interface

Technical Info

Misc Documents

Misc Thoughts and Opinions

These thoughts and opinions were mostly formed during the creation of this document while researching various search solutions.

  • Elasticsearch provides a RESTful API endpoint for all requests from all languages. Sphinx provides language specific wrappers for the API to communicate with the service.
  • It seems more straightforward to push arbitrary documents and schema via JSON at Elasticsearch than to create fields as Sphinx requires. I'm not entirely sure on this point yet.
  • Sphinx is definitely designed around a SQL type structure, though it has been modified over time to support other data stores. I think this could be an issue.
  • That Elasticsearch is developed on GitHub is a big positive for me. The combined interfaces of MantisBT and Google's code repository is a little annoying.
  • Decisions like implementing xmlpipe2 and tsvpipe by Sphinx as data sources are somewhat confusing. I think the standard formats offered with Solr and Elasticsearch make more sense.
  • Elasticsearch was built to be real time from the beginning. Solr is near real-time. Sphinx started as a batch indexer and moved (rightly) to real time over time. See Sphinx real time caveats.
  • I'm a fan of this:
    one can launch ElasticSearch and start sending documents to it in order to have them indexed without creating any sort of index schema and ElasticSearch will try to guess field types.
@GitKageHub
Copy link

I would also like to see this research continued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment