Skip to content

Instantly share code, notes, and snippets.

@Amit88k
Last active January 5, 2023 21:17
Show Gist options
  • Save Amit88k/acc541a068d4916d87e58b5f646a64c3 to your computer and use it in GitHub Desktop.
Save Amit88k/acc541a068d4916d87e58b5f646a64c3 to your computer and use it in GitHub Desktop.
Indexing MongoDB data in Apache Solr
Indexing data for fast and efficient retrieval is one of the important feture, that each application requires. I have used MongoDB to store data and Solr to index the data. Although MongoDB provides built-in full-text search capabilities but does not provide advanced indexing and search features.
I have used Linux OS environment for this Gist. The following softwares are required for this Gist-
- Java
- Python
- Apache Solr
- MongoDB
- Mongo Connector
Install JDK8:
# sudo yum install java-1.8.0-openjdk-devel
Set JAVA_HOME / PATH for a single user
Login to your account and open .bash_profile file
# vi ~/.bash_profile
Set JAVA_HOME as follows using syntax export JAVA_HOME=<path-to-java>. If your path is set to /usr/java/jdk1.5.0_07/bin/java, set it as follows:
export JAVA_HOME=/usr/java/java-1.8.0-openjdk/bin/java
Install python on linux machine: #yum install -y python36u
Get Python path: #which python
You need to setup global config in /etc/profile OR /etc/bash.bashrc file for all users:
#vi /etc/profile
Setup python just like PATH / JAVA_PATH variables as follows:
export PATH=$PATH:/usr/bin/python3.6 (path that you get after running command -> #which python)
Save the changes
#source /etc/profile
Create a directory /solr to install the software and set its permissions to global (777). using following commands:
# mkdir /solr (I generally prefer to do in /opt)
# chmod 777 /solr
# cd /solr
Download and extract the Apache Solr tgz file:
# wget http://apache.mirror.vexxhost.com/lucene/solr/5.3.1/solr-5.3.1.tgz
# tar -xvf solr-5.3.1.tgz
NOTE: You can download other vresions of solr from here: https://archive.apache.org/dist/lucene/solr/
Goto solr-5.3.1 directory:
# cd solr-5.3.1
Start solr server:
# ./bin/solr start
Check status of solr server:
# ./bin/solr status
Create a solr core:
# ./bin/solr create -c wlslog
Next, we need to configure Apache Solr. The fields in the MongoDB documents to be indexed are specified in the schema.xml configuration file. Open the schema.xml in a vi editor.
# vi /solr/solr-5.3.1/server/solr/wlslog/conf/schema.xml
Add fields time_stamp,category,type,servername,code, and msg. Mongo Connector also stores the metadata associated with the each MongoDB document it indexes in fields ns and _ts. Also add the ns and _ts fields to the schema.xml.
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.5">
<field name="time_stamp" type="string" indexed="true" stored="true" multiValued="false" />
<field name="category" type="string" indexed="true" stored="true" multiValued="false" />
<field name="type" type="string" indexed="true" stored="true" multiValued="false" />
<field name="servername" type="string" indexed="true" stored="true" multiValued="false" />
<field name="code" type="string" indexed="true" stored="true" multiValued="false" />
<field name="msg" type="string" indexed="true" stored="true" multiValued="false" />
<field name="_ts" type="long" indexed="true" stored="true" />
<field name="ns" type="string" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<uniqueKey>id</uniqueKey>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
</schema>
Fields not defined in schema.xml are not indexed. We also need to configure the org.apache.solr.handler.admin.LukeRequestHandler request handler in the solrconfig.xml. Requests to Solr server are routed through the request handler. Open the solrconfig.xml in the vi editor.
# vi ./solr-5.3.1/server/solr/wlslog/conf/solrconfig.xml
Specify the request handler for the Mongo Connector.
<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
Also configure the auto commit to true so that Solr auto commits the data from MongoDB after the configured time.
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
After modifying the schema.xml and solrconfig.xml the Solr server needs to be restarted.
# bin/solr restart
NOTE: I'll add the complete files schema.xml and solrconfig.xml in other section.
1. Configure the package management system (yum).
Create a /etc/yum.repos.d/mongodb-org-4.2.repo file so that you can install MongoDB directly using yum:
[mongodb-org-4.2]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/4.2/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-4.2.asc
2. Install the MongoDB packages:
# sudo yum install -y mongodb-org
3. Create a directory for dbpath:
# sudo mkdir /data/db
4. Start MongoDB
# mongod --port 27017 --dbpath /data/db --replSet rs0 --bind_ip localhost,BigData
Note: Here BigData is the host name.
5. Start mongo shell:
# mongo
6. MongoDB shell gets started. We need to initiate the replica set. Run the following command to initiate the replica set:
# rs.initiate()
In this section we shall create a MongoDB collection to store documents.
1. Set the MongoDB database to “solr”, which is created implicitly when a new database is initialized by invoking some command on it.
# use solr
2. Store MongoDB documents in a collection called “wlslog”. First, find if the collection already exists and does it have documents in it with the following command.
# db.wlslog.find()
3. If some documents get listed the collection, wlslog exists and has documents in it. Drop the wlslog collection with the following command.
# db.wlslog.drop()
4. Create the wlslog collection again with the following command.
# db.createCollection("wlslog")
In this section, we shall add some documents to the wlslog collection. Create JSON for three documents doc1, doc2, and doc3.
1. doc1 = {"time_stamp":"Apr-8-2014-7:06:16-PM-PDT","category": "Notice","type":"WebLogicServer",
"servername": "AdminServer","code":"BEA-000365","msg": "Server state changed to STANDBY" }
2. doc2 ={"time_stamp":"Apr-8-2014-7:06:17-PM-PDT","category": "Notice","type":"WebLogicServer",
"servername": "AdminServer","code":"BEA-000365","msg": "Server state changed to STARTING" }
3. doc3 ={"time_stamp":"Apr-8-2014-7:06:18-PM-PDT","category": "Notice","type":"WebLogicServer",
"servername": "AdminServer","code":"BEA-000360","msg": "Server started in RUNNING mode" }
Three documents get created.
4. Add the three documents to MongoDB with the following command:
# db.wlslog.insert([doc1,doc2,doc3])
The three documents get added. The nInserted value of 3 indicates that 3 documents have been added.
5. Query the wlslog collection with the find() method.
# db.wlslog.find()
The three documents get listed.
1. To install the Mongo Connector run the following command.
# pip install mongo-connector
2. New in mongo-connector 2.5.0, to install mongo-connector with the solr-doc-manager run:
# pip install 'mongo-connector[solr]'
3. Run monog-connector:
# mongo-connector --unique-key=id -n solr.wlslog -m localhost:27017 -t http://localhost:8983/solr/wlslog -d solr_doc_manager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment