Skip to content

Instantly share code, notes, and snippets.

@karussell
Last active October 30, 2023 16:14
Show Gist options
  • Star 17 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save karussell/2878392 to your computer and use it in GitHub Desktop.
Save karussell/2878392 to your computer and use it in GitHub Desktop.
ElasticSearch from SQL DB
Why is there no such DataImportHandler thing in ElasticSearch? Uhm, well ... but because:
1. You should really consider your own scripts
(be it jvm based, perl, ruby, php, nodejs/javascript)
to feed ElasticSearch via bulk indexing:
http://www.elasticsearch.org/guide/reference/java-api/bulk.html
2. There are two projects doing it already:
* http://code.google.com/p/sql-to-nosql-importer/
* https://github.com/Aconex/scrutineer (keeps DB in synch with ES or solr!)
3. In theorie you could use DIH in elasticsearch as well ;)
http://www.mattweber.org/2011/12/14/elasticsearch-mock-solr-plugin/
4. In mysql you could add an http trigger:
http://code.google.com/p/mysql-udf-http
5. You could use hydra and start with the 'correct' & scalable
solution: https://github.com/Findwise/Hydra
@jprante
Copy link

jprante commented Jun 5, 2012

  • Elasticsearch has the river mechanism for gathering data from external sources. A JDBC river with configurable SQL statement and driver is something like a DataImportHandler. JDBC rivers would be able to sync with many external databases. I'm working on such a beast, many things are not there, for instance building nested JSON objects from mapped SQL columns, which is not straightforward like reading rows from a table for bulk indexing. Prototype: https://github.com/jprante/elasticsearch-river-jdbc

@mattweber
Copy link

I have looked into this a while back. Seems that if you wanted to use DIH, it would be pretty trivial to patch.

  1. Implement https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHWriter.java interface using CommonsHttpSolrServer pointing at an ElasticSearch instance with my Mock Solr plugin installed.
  2. Patch https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DataImportHandler.java#L281 to use custom implementation.

@dadoonet
Copy link

dadoonet commented Jun 6, 2012

@jprante : very interesting start. One thing I did not understand is how do you know that there is a change in the SQL database ?
Are you currently reindexing all the resultset each time ?
Do you think adding something like a "where clause" or another SQL that the user can set to tell the river that something has changed ?

@karussell
Copy link
Author

Thanks * 2!

@jprante
Copy link

jprante commented Jun 6, 2012

@dadoonet because a river is a "pull mechanism", the idea is to fetch all the data at least once, but then, it should be able to get incremental updates. A "push mechanism" would be much better of course (see mysql-udf-http)

One option is time-based incremental updating. I think the SQL statement can be specified as a template, using something like the current timestamp "$now" plus Elasticsearch TimeValues as variables that can be evaluated for JDBC DATE, TIME, or TIMESTAMP bind parameters.

The other option is like you said, an incremental set of rows. A second SQL statement parameter could be used to return a set indicating a change in the data (or is empty if there is no change). The result set consists of row IDs for the first SQL statement, which returns rows that can be indexed bulk-wise.

@karussell
Copy link
Author

a guy from twitter (@joelwes) suggested me for the sql part the manifoldCF project. Have a look:

http://incubator.apache.org/connectors/en_US/index.html

@thg303
Copy link

thg303 commented Aug 8, 2017

@karussell They are out of incubator now https://manifoldcf.apache.org/
and as the documentation says:

Apache ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr, Open Search Server, or ElasticSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment