Skip to content

Instantly share code, notes, and snippets.

@rnjailamba
Last active March 16, 2024 14:10
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save rnjailamba/8c872768b136a88a10b1 to your computer and use it in GitHub Desktop.
Save rnjailamba/8c872768b136a88a10b1 to your computer and use it in GitHub Desktop.
Solr + Mysql + Data Import Handler + Delta Import Handler [ Version 5.5.0 of solr ]

Importing/Indexing database (MySQL) in Solr using Data/Delta Import Handler

Install Solr

Download and install Solr from http://lucene.apache.org/solr/. You can access Solr admin from your browser: http://localhost:8983/solr/

MySQL connector

Download JDBC driver for MySQL from http://dev.mysql.com/downloads/connector/j/.
Copy file from the downloaded archive 'mysql-connector-java-*.jar'
to the folder '/lib' in the folder where Solr was installed.
Create 'lib' folder if needed.

Setup a new collection

bin/solr start
bin/solr create -c myproducts
The collection will be located in 'solr-5.5.0/server/solr/myproducts' folder.
You will see folders conf and data in the collection folder:

  • /solr/myproducts/conf
  • /solr/myproducts/data

solrconfig.xml

edit solrconfig.xml by adding:

Make sure that 'dist' folder contains two files for data import handler:

  • solr-dataimporthandler-4.10.2.jar
  • solr-dataimporthandler-extras-4.10.2.jar

Add these lines to solrconfig.xml:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml for MySQL database

The file 'data-config.xml' will define data we want to import/index from our datasource. Assuming that our DB named mydb1 and we have table products with columns id, name and updated_at.
Column 'updated_at' of datetime type stores the date of last modification of the row.
This column will be used in incremental import to track rows modified since the last import into Solr.
Create table -
  CREATE TABLE products (

    id int(11) NOT NULL AUTO_INCREMENT,

    updated_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

    name varchar(163) DEFAULT NULL,

    PRIMARY KEY (id)

  )

<dataConfig>
<dataSource type="JdbcDataSource" 
            driver="com.mysql.jdbc.Driver"
            url="jdbc:mysql://localhost:3306/mydb1" 
            user="root" 
            password="root"/>
<document>
  <entity name="product"  
    pk="id"
    query="select id,name from products"
    deltaImportQuery="SELECT id,name from products WHERE id='${dih.delta.id}'"
    deltaQuery="SELECT id FROM products  WHERE updated_at > '${dih.last_index_time}'"
    >
     <field column="id" name="id"/>
     <field column="name" name="name"/>       
  </entity>
</document>
</dataConfig>

One issue that i had was with difference in timezone of Mysql and Solr due to which even my delta-imports were functioning as full-imports.
My workaround was - deltaQuery="SELECT id FROM products WHERE updated_at > DATE_ADD('${dih.last_index_time}',INTERVAL 330 MINUTE)"

  • The 'query' gives the data needed to populate fields of the Solr document in full-import
  • The 'deltaImportQuery' gives the data needed to populate fields when running a delta-import
  • The 'deltaQuery' gives the primary keys of the current entity which have changes since the last index time
  • dih.last_index_time is located in dataimport.properties in same folder as solrconfig.xml
  • data-config.xml is located in same folder as solrconfig.xml

Full-import command uses the "query" query, delta-import command uses the delta components.

schema.xml

[ schema.xml has been removed in favor of managed-schema these days. To get it back read the following - https://gist.github.com/rnjailamba/dc5068fbd883d963f7ec ]

edit file 'schema.xml' accordingly to fields defined in data-import.xml:

<schema name="example" version="1.5">
    <field name="_version_" type="long" indexed="true" stored="true"/>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="name" type="string" indexed="true" stored="true" multiValued="false" />
    ...

Perform full or delta import

After successfully adding a collection to Solr you can select it and run dataimport commands from Solr Admin -

The full import loads all data every time, while incremental import means only adding the data that changed since the last indexing. By default, full import starts with removal the existing index (parameter clean=true). Note! Use clean=false while running delta-import command. debug=true - The debug mode limits the number of rows to 10 by default and it also forces indexing to be synchronous with the request.

References

@pallavtrivedi03
Copy link

I did everything as mentioned and I am getting an error "Data Config problem: Content is not allowed in prolog." while executing full import. Any idea, what went wrong?

@avibhamare
Copy link

Remove below first line from your data-config.xml
# define data source

@rnjailamba
Copy link
Author

updated

@stephenjesus
Copy link

its useful for developers
https://onlinedevtools.in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment