Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
ElasticSearch from SQL DB
Why is there no such DataImportHandler thing in ElasticSearch? Uhm, well ... but because:
1. You should really consider your own scripts
(be it jvm based, perl, ruby, php, nodejs/javascript)
to feed ElasticSearch via bulk indexing:
http://www.elasticsearch.org/guide/reference/java-api/bulk.html
2. There are two projects doing it already:
* http://code.google.com/p/sql-to-nosql-importer/
* https://github.com/Aconex/scrutineer (keeps DB in synch with ES or solr!)
3. In theorie you could use DIH in elasticsearch as well ;)
http://www.mattweber.org/2011/12/14/elasticsearch-mock-solr-plugin/
4. In mysql you could add an http trigger:
http://code.google.com/p/mysql-udf-http
5. You could use hydra and start with the 'correct' & scalable
solution: https://github.com/Findwise/Hydra
@jprante

This comment has been minimized.

Copy link

jprante commented Jun 5, 2012

  • Elasticsearch has the river mechanism for gathering data from external sources. A JDBC river with configurable SQL statement and driver is something like a DataImportHandler. JDBC rivers would be able to sync with many external databases. I'm working on such a beast, many things are not there, for instance building nested JSON objects from mapped SQL columns, which is not straightforward like reading rows from a table for bulk indexing. Prototype: https://github.com/jprante/elasticsearch-river-jdbc
@mattweber

This comment has been minimized.

Copy link

mattweber commented Jun 5, 2012

I have looked into this a while back. Seems that if you wanted to use DIH, it would be pretty trivial to patch.

  1. Implement https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DIHWriter.java interface using CommonsHttpSolrServer pointing at an ElasticSearch instance with my Mock Solr plugin installed.
  2. Patch https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DataImportHandler.java#L281 to use custom implementation.
@dadoonet

This comment has been minimized.

Copy link

dadoonet commented Jun 6, 2012

@jprante : very interesting start. One thing I did not understand is how do you know that there is a change in the SQL database ?
Are you currently reindexing all the resultset each time ?
Do you think adding something like a "where clause" or another SQL that the user can set to tell the river that something has changed ?

@karussell

This comment has been minimized.

Copy link
Owner Author

karussell commented Jun 6, 2012

Thanks * 2!

@jprante

This comment has been minimized.

Copy link

jprante commented Jun 6, 2012

@dadoonet because a river is a "pull mechanism", the idea is to fetch all the data at least once, but then, it should be able to get incremental updates. A "push mechanism" would be much better of course (see mysql-udf-http)

One option is time-based incremental updating. I think the SQL statement can be specified as a template, using something like the current timestamp "$now" plus Elasticsearch TimeValues as variables that can be evaluated for JDBC DATE, TIME, or TIMESTAMP bind parameters.

The other option is like you said, an incremental set of rows. A second SQL statement parameter could be used to return a set indicating a change in the data (or is empty if there is no change). The result set consists of row IDs for the first SQL statement, which returns rows that can be indexed bulk-wise.

@karussell

This comment has been minimized.

Copy link
Owner Author

karussell commented Jun 12, 2012

a guy from twitter (@joelwes) suggested me for the sql part the manifoldCF project. Have a look:

http://incubator.apache.org/connectors/en_US/index.html

@thg303

This comment has been minimized.

Copy link

thg303 commented Aug 8, 2017

@karussell They are out of incubator now https://manifoldcf.apache.org/
and as the documentation says:

Apache ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr, Open Search Server, or ElasticSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.