arfprogrammer/Sphinx Search Engine.md

## Sphinx Search Engine.md

      
    Raw
  

              Sphinx Search Engine.md
            
          
    Installation

In this section, we will install Sphinx.
To install Sphinx, run:
sudo apt-get install sphinxsearch

Now you have successfully installed Sphinx on your server. Before starting the Sphinx daemon, let's configure it.
Creating the Database For File Index and Search

In this section, we will set up a database and create tables that contains path of text files that we want to sphinx index them and search query from them.
Log in to the MySQL server shell.
mysql -u root -p
Enter the password for the MyQL root user when asked. Your prompt will change to mysql>.
Create a database named sphinx_index and use it;
CREATE DATABASE sphinx_index;
USE sphinx_index;
Create files path list table
CREATE TABLE fileindex ( id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,text VARCHAR(100) NOT NULL);
Add files path to fileindex table
INSERT INTO fileindex ( text ) VALUES ( '\path\to\files' )

Then exit the MySQL shell.
Configuring Sphinx

In this section, we will configure the Sphinx configuration file.
Create the sphinx.conf file.
sudo gedit /etc/sphinxsearch/sphinx.conf

Sphinx configuration consists of 3 main blocks that are essential to run. They are index, searchd, and source. Each of these blocks is described below, and at the end of this step, the entirety of sphinx.conf is included for you to paste into the file.
The source block contains the type of source, username and password to the MySQL server. The first column of the SQL query should be a unique id. The SQL query will run on every index and dump the data to Sphinx index file. Below are descriptions of each field and the source block itself.

sql_host: Hostname for the MySQL host. In our example, this is the localhost. This can be a domain or IP address.
sql_user: Username for the MySQL login. In our example, this is root.
sql_pass: Password for the MySQL user. In our example, this is the root MySQL user's password
sql_db: Name of the database that stores data. In our example, this is test.
sql_query: This is the query thats dumps data to
sql_query_pre: Pre-fetch query, or pre-query. They are used to setup encoding, mark records that are going to be indexed, update internal counters, set various per-connection SQL server options and variables. Perhaps the most frequent pre-query usage is to specify the encoding that the server will use for the rows it returns. Note that Sphinx accepts only UTF-8 texts.
sql_field_string:Combined string attribute and full-text field declaration.
sql_file_field: Reads document contents from file system instead of database.
* Offloads database
* Prevents cache trashing on database side
* Much faster in some cases

source src1
{
    type            = mysql
    sql_host        = localhost
    sql_user        = root
    sql_pass        = 3337033
    sql_db          = sphinx_index
    sql_port        = 3306  # optional, default is 3306
    sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
    sql_query_pre = SET NAMES utf8
    sql_query       = SELECT id,text from fileindex
    sql_file_field = text
	sql_field_string = text

}

The index component contains the source and the path to store the data.

source: Name of the source block. In our example, this is src1.
path: This path to save the index.
docinfo: Document attribute values (docinfo) storage mode. Optional, default is 'extern'. Known values are 'none', 'extern' and 'inline'.

index filename
{
    source          = src1
    path            = /var/lib/sphinxsearch/data/files
    docinfo         = extern
}

The searchd component contains the port and other variables to run the Sphinx daemon.

listen: This is the port which sphinx daemon will run. In our example, this is 9312.
query_log: This path to save the query log.
pid_file: This is path to PID file of Sphinx daemon.
max_matches: Maximum number matches to return per search term.
seamless_rotate: Prevents searchd stalls while rotating indexes with huge amounts of data to precache.
preopen_indexes: Whether to forcibly preopen all indexes on startup.
unlink_old: Whether to unlink old index copies on successful rotation.
log: Log file name. Optional, default is 'searchd.log'.
read_timeout: Network client request read timeout, in seconds.
max_children: Maximum amount of children to fork (or in other words, concurrent searches to run in parallel). Optional, default is 0 (unlimited).
seamless_rotate: Prevents searchd stalls while rotating indexes with huge amounts of data to precache. Optional, default is 1 (enable seamless rotation).
binlog_path: Binary log (aka transaction log) files path. Optional, default is build-time configured data directory.

searchd
{
	listen            = 9312
	log               = /var/log/sphinxsearch/searchd.log
	query_log         = /var/log/sphinxsearch/query.log
	read_timeout      = 5
	max_children      = 30
	pid_file          = /var/run/sphinxsearch/searchd.pid
	max_matches       = 1000
	seamless_rotate   = 1
	preopen_indexes   = 1
	unlink_old        = 1
	binlog_path       = /var/lib/sphinxsearch/data

}

Adding Data to the Index

In this section, we'll add data to the Sphinx index.
Add data to index using the config we created earlier.
sudo indexer --all --rotate

You should get something that looks like the following.
Sphinx 2.2.10-id64-release (2c212e0)
Copyright (c) 2001-2015, Andrew Aksyonoff
Copyright (c) 2008-2015, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/etc/sphinxsearch/sphinx.conf'...
WARNING: key 'max_matches' was permanently removed from Sphinx configuration. Refer to documentation for details.
indexing index 'filename'...
collected 1 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 1 docs, 41896 bytes
total 0.073 sec, 566705 bytes/sec, 13.52 docs/sec
total 8 reads, 0.014 sec, 9.0 kb/call avg, 1.8 msec/call avg
total 12 writes, 0.000 sec, 4.6 kb/call avg, 0.0 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=1087).

Starting Sphinx

First open /etc/default/sphinxsearch to check Sphinx daemon is tuned off or on.
sudo nano /etc/default/sphinxsearch

To enable Sphinx, find the line START and set it to yes.
START=yes

Then, save and close the file.
Finally, start the Sphinx daemon.
sudo service sphinxsearch start

Testing Search

For search from indexed contents we should use official native SphinxAPI implementations for PHP, Perl, Python, Ruby and Java or third party API ports and plugins for Perl, C#, Haskell, Ruby-on-Rails.
Official native SphinxAPIs are included within the distribution package.
Download sphinx search source from GitHub
cd API
Create Search.py
from sphinxapi import *
client = SphinxClient()
client.SetServer('127.0.0.1', 9312)
client.Query('text to search')
execute this code with python2.7 and result should like this if search is successful
{'status': 0, 'matches': [{'id': 1, 'weight': 2500, 'attrs': {'text': '/home/arf/Downloads/ElasticSearch.md'}}], 'fields': ['text'], 'time': '0.000', 'total_found': 1, 'warning': '', 'attrs': [['text', 7]], 'words': [{'docs': 1, 'hits': 1, 'word': 'text'}, {'docs': 1, 'hits': 159, 'word': 'to'}, {'docs': 1, 'hits': 32, 'word': 'search'}], 'error': '', 'total': 1}

and if search is not successful result should like bellow
{'status': 0, 'matches': [], 'fields': ['text'], 'time': '0.000', 'total_found': 0, 'warning': '', 'attrs': [['text', 7]], 'words': [{'docs': 0, 'hits': 0, 'word': 'eqfc'}], 'error': '', 'total': 0}

Other Descriptions

It looks like that sphinx works fine with database systems for indexing and search fields.
Unfortunately, Sphinx can't index .doc and .pdf file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
References

Sphinx Documentation
How To Install and Configure Sphinx on Ubuntu 14.04
How to index plain text files for search in Sphinx
Sphinx Search Engine & Python API
Full-Text Search with Sphinx and PHP
Indexing Word Documents and PDFs with Sphinx