santanub/elasticsearch.text

## elasticsearch.text
Installing elasticsearch
1. Visit https://www.elastic.co/downloads
2. Download zip version. Unzip it.
3. Run bin/elasticsearch
4. Visit http://localhost:9200/. If it return status 200, it is succesfully installed.

Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for documents. JSON serialization is
supported by most programming languages, and has become the standard format used by the NoSQL movement. It is simple,
concise, and easy to read.

Suppose we have an user object. We can convert the structure and meaning into JSON version. Coverting a object into
meaningful JSON is much simpler.

{
    "email":      "sb@kreeti.com",
    "first_name": "Santanu",
    "last_name":  "Bhattacharya",
    "info": {
        "bio":         "Eco-warrior and defender of the weak",
        "age":         30,
        "interests": [ "dolphins", "whales" ]
    },
    "join_date": "2012/07/06"
}


Indexing

Before searching, we have to store the data. A single document represent a single user. The act of storign data
in the elasticsearch is called Indexing.  We need to decide where to store these indices.

Relational DB  ⇒ Databases ⇒ Tables ⇒ Rows      ⇒ Columns
Elasticsearch  ⇒ Indices   ⇒ Types  ⇒ Documents ⇒ Fields

If we want to store user indexes, we can create an index named kreeti, types is user. It contains all the details
of a single user. When we are indicing

PUT /kreeti/user/1
data
{
    "first_name" : "santanu",
    "last_name" :  "Bhattacharya",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

GET /kreeti/user/1
reponse
{
  "_index" :   "kreeti",
  "_type" :    "user",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "first_name" :  "Santanu",
      "last_name" :   "Bhattacharya",
      "age" :         25,
      "about" :       "I love to go rock climbing",
      "interests":  [ "sports", "music" ]
  }
}

GET /kreeti/user/_search

Return all the users. Default 10.

GET /kreeti/user/_search?q=first_name:Santanu

GET /kreeti/user/_search
{
    "query" : {
        "match" : {
            "first_name" : "Santanu"
        }
    }
}

this will return the same as the previous request. This is called DSL(domain specific language). Here the difference
is we are no longer using query string but we are using a request body built with a JSON and uses a match query.

GET /kreeti/user/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 }
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "Bhattacharya"
                }
            }
        }
    }
}

Full Text Search---We are going to search for all users who enjoy rock climbing:

GET /kreeti/user/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document
matches the query.

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.16273327,
      "hits": [
         {
            ...
            "_score":         0.16273327,
            "_source": {
               "first_name":  "Santanu",
               "last_name":   "Bhattacharya",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_score":         0.016878016,
            "_source": {
               "first_name":  "Santanu",
               "last_name":   "Karmakar",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

Phrase Search

GET /kreeti/user/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

Finding Exact values

POST /my_store/products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }


As discussed in query DSL, the search API expects a query not a filter.

GET /my_store/products/_search
{
    "query" : {
        "filtered" : {
            "query" : {
                "match_all" : {}
            },
            "filter" : {
                "term" : {
                    "price" : 20
                }
            }
        }
    }
}

Filter with Text

GET /my_store/products/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "term" : {
                    "productID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

select * from products where productID = "XHDK-A-1293-#fJ3"

SELECT product FROM   products WHERE  (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND  (price != 30)

GET /my_store/products/_search
{
   "query" : {
      "filtered" : {
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}},
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}}
              ],
              "must_not" : {
                 "term" : {"price" : 30}
              }
           }
         }
      }
   }
}

Bool Filteredit

The bool filter is composed of three sections:

{
   "bool" : {
      "must" :     [],
      "should" :   [],
      "must_not" : [],
   }
}
must
  All of these clauses must match. The equivalent of AND.
must_not
  All of these clauses must not match. The equivalent of NOT.
should
  At least one of these clauses must match. The equivalent of OR.

Single Query String

Today if we want to advanced search, user want a single field to type all their search terms.
When your only user input is a single query string, you will encounter three scenarios frequently:

Best fields
    When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other.
Imagine that we have a website that allows users to search blog posts, such as these two documents:

PUT /my_index/my_type/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /my_index/my_type/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

The user types in the words “Brown fox” and clicks Search. We don’t know ahead of time if the user’s search terms will be found in the title or the body field of the post, but it is likely that the user is searching for related words. To our eyes, document 2 appears to be the better match, as it contains both words that we are looking for.

Now we run the following bool query:

{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

dis_max query

Instead of the bool query, we can use the dis_max or Disjunction Max Query, return documents that match any of these queries.
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

Most fields
    A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain. The main field may contain words in their stemmed form, synonyms.

Cross fields
   For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:
        User: first_name and last_name
        Address: street, city, country, and postcode
    In this case, we want to find as many words as possible in any of the listed fields. We need to search across multiple fields as if they were one big field.

User indexed as
{
    first_name: "santanu",
        "last_name" : "bhattacharya"
}

Address indexed as
{
    street: "sdfsdf",
    postal_code: "78894556",
    ...
}

{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "most_fields",
            "operator":    "and",
            "fields":      [ "first_name", "last_name" ]
        }
    }
}


{
  "query": {
    "multi_match": {
      "query":       "Poland Street W1V",
      "type":        "cross_fields",
      "fields":      [ "street", "city", "country", "postcode", "first_name", "last_name" ]
    }
  }
}

The cross_fields first analyzes the query string to produce a list of terms and then match the each term to any of the fields listed there.

Pagination

  Our preceding empty search told us that 14 documents in the cluster match our (empty) query. But there were
  only 10 documents in the hits array. How can we see the other documents?

size
  Indicates the number of results that should be returned, defaults to 10
from
  Indicates the number of initial results that should be skipped, defaults to 0

If you wanted to show five results per page, then pages 1 to 3 could be requested as follows:

GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10

How to implement it in our rails applications

gem 'elasticsearch-model'
gem 'elasticsearch-persistence'
gem 'elasticsearch-rails'

SET UP INDEX

in user model
class User < ActiveRecord::Base
  include Elasticsearch::Model

  index_name "users_index"

  settings index: {
    number_of_shards: 1,
  } do
    mapping dynamic: 'false' do
      indexes :first_name,     type: "string", index_analyzer: "word_start", search_analyzer: "standard"
      indexes :company,        type: "string"
      indexes :no_of_products, type: "long"

      indexes :address do
        indexes :created_at,   type: "date"
      end
    end
  end
end

query_parameter = {
  query: {
    filtered: {
      filter: {
        bool: {
          must: [
               { term: { first_name: "Santanu" } },
               { range: { age: { gt: 30 } },
            ]
         }
       }

    }},
 :sort=>[{"created_at"=>"desc"}]
}
 INDEX DOCUMENT
 User.__elasticsearch__.create_index

 User.__elasticsearch__.create_index! force: true

 user = User.find(10)
 user.__elasticsearch__.index_document

 To delete index
 user.__elasticsearch__.delete_document

 User.import
 User.import(scope: :name_of_scope)
User.search(query_parameter).records
User.search(query_parameter).per(10).page(2).records  # for pagination
	Installing elasticsearch
	1. Visit https://www.elastic.co/downloads
	2. Download zip version. Unzip it.
	3. Run bin/elasticsearch
	4. Visit http://localhost:9200/. If it return status 200, it is succesfully installed.

	Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for documents. JSON serialization is
	supported by most programming languages, and has become the standard format used by the NoSQL movement. It is simple,
	concise, and easy to read.

	Suppose we have an user object. We can convert the structure and meaning into JSON version. Coverting a object into
	meaningful JSON is much simpler.

	{
	"email": "sb@kreeti.com",
	"first_name": "Santanu",
	"last_name": "Bhattacharya",
	"info": {
	"bio": "Eco-warrior and defender of the weak",
	"age": 30,
	"interests": [ "dolphins", "whales" ]
	},
	"join_date": "2012/07/06"
	}


	Indexing

	Before searching, we have to store the data. A single document represent a single user. The act of storign data
	in the elasticsearch is called Indexing. We need to decide where to store these indices.

	Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
	Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields

	If we want to store user indexes, we can create an index named kreeti, types is user. It contains all the details
	of a single user. When we are indicing

	PUT /kreeti/user/1
	data
	{
	"first_name" : "santanu",
	"last_name" : "Bhattacharya",
	"age" : 25,
	"about" : "I love to go rock climbing",
	"interests": [ "sports", "music" ]
	}

	GET /kreeti/user/1
	reponse
	{
	"_index" : "kreeti",
	"_type" : "user",
	"_id" : "1",
	"_version" : 1,
	"found" : true,
	"_source" : {
	"first_name" : "Santanu",
	"last_name" : "Bhattacharya",
	"age" : 25,
	"about" : "I love to go rock climbing",
	"interests": [ "sports", "music" ]
	}
	}

	GET /kreeti/user/_search

	Return all the users. Default 10.

	GET /kreeti/user/_search?q=first_name:Santanu

	GET /kreeti/user/_search
	{
	"query" : {
	"match" : {
	"first_name" : "Santanu"
	}
	}
	}

	this will return the same as the previous request. This is called DSL(domain specific language). Here the difference
	is we are no longer using query string but we are using a request body built with a JSON and uses a match query.

	GET /kreeti/user/_search
	{
	"query" : {
	"filtered" : {
	"filter" : {
	"range" : {
	"age" : { "gt" : 30 }
	}
	},
	"query" : {
	"match" : {
	"last_name" : "Bhattacharya"
	}
	}
	}
	}
	}

	Full Text Search---We are going to search for all users who enjoy rock climbing:

	GET /kreeti/user/_search
	{
	"query" : {
	"match" : {
	"about" : "rock climbing"
	}
	}
	}

	By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document
	matches the query.

	{
	...
	"hits": {
	"total": 2,
	"max_score": 0.16273327,
	"hits": [
	{
	...
	"_score": 0.16273327,
	"_source": {
	"first_name": "Santanu",
	"last_name": "Bhattacharya",
	"age": 25,
	"about": "I love to go rock climbing",
	"interests": [ "sports", "music" ]
	}
	},
	{
	...
	"_score": 0.016878016,
	"_source": {
	"first_name": "Santanu",
	"last_name": "Karmakar",
	"age": 32,
	"about": "I like to collect rock albums",
	"interests": [ "music" ]
	}
	}
	]
	}
	}

	Phrase Search

	GET /kreeti/user/_search
	{
	"query" : {
	"match_phrase" : {
	"about" : "rock climbing"
	}
	}
	}

	Finding Exact values

	POST /my_store/products/_bulk
	{ "index": { "_id": 1 }}
	{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
	{ "index": { "_id": 2 }}
	{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
	{ "index": { "_id": 3 }}
	{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
	{ "index": { "_id": 4 }}
	{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }


	As discussed in query DSL, the search API expects a query not a filter.

	GET /my_store/products/_search
	{
	"query" : {
	"filtered" : {
	"query" : {
	"match_all" : {}
	},
	"filter" : {
	"term" : {
	"price" : 20
	}
	}
	}
	}
	}

	Filter with Text

	GET /my_store/products/_search
	{
	"query" : {
	"filtered" : {
	"filter" : {
	"term" : {
	"productID" : "XHDK-A-1293-#fJ3"
	}
	}
	}
	}
	}

	select * from products where productID = "XHDK-A-1293-#fJ3"

	SELECT product FROM products WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30)

	GET /my_store/products/_search
	{
	"query" : {
	"filtered" : {
	"filter" : {
	"bool" : {
	"should" : [
	{ "term" : {"price" : 20}},
	{ "term" : {"productID" : "XHDK-A-1293-#fJ3"}}
	],
	"must_not" : {
	"term" : {"price" : 30}
	}
	}
	}
	}
	}
	}

	Bool Filteredit

	The bool filter is composed of three sections:

	{
	"bool" : {
	"must" : [],
	"should" : [],
	"must_not" : [],
	}
	}
	must
	All of these clauses must match. The equivalent of AND.
	must_not
	All of these clauses must not match. The equivalent of NOT.
	should
	At least one of these clauses must match. The equivalent of OR.

	Single Query String

	Today if we want to advanced search, user want a single field to type all their search terms.
	When your only user input is a single query string, you will encounter three scenarios frequently:

	Best fields
	When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other.
	Imagine that we have a website that allows users to search blog posts, such as these two documents:

	PUT /my_index/my_type/1
	{
	"title": "Quick brown rabbits",
	"body": "Brown rabbits are commonly seen."
	}

	PUT /my_index/my_type/2
	{
	"title": "Keeping pets healthy",
	"body": "My quick brown fox eats rabbits on a regular basis."
	}

	The user types in the words “Brown fox” and clicks Search. We don’t know ahead of time if the user’s search terms will be found in the title or the body field of the post, but it is likely that the user is searching for related words. To our eyes, document 2 appears to be the better match, as it contains both words that we are looking for.

	Now we run the following bool query:

	{
	"query": {
	"bool": {
	"should": [
	{ "match": { "title": "Brown fox" }},
	{ "match": { "body": "Brown fox" }}
	]
	}
	}
	}

	dis_max query

	Instead of the bool query, we can use the dis_max or Disjunction Max Query, return documents that match any of these queries.
	{
	"query": {
	"dis_max": {
	"queries": [
	{ "match": { "title": "Brown fox" }},
	{ "match": { "body": "Brown fox" }}
	]
	}
	}
	}

	Most fields
	A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain. The main field may contain words in their stemmed form, synonyms.

	Cross fields
	For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:
	User: first_name and last_name
	Address: street, city, country, and postcode
	In this case, we want to find as many words as possible in any of the listed fields. We need to search across multiple fields as if they were one big field.

	User indexed as
	{
	first_name: "santanu",
	"last_name" : "bhattacharya"
	}

	Address indexed as
	{
	street: "sdfsdf",
	postal_code: "78894556",
	...
	}

	{
	"query": {
	"multi_match": {
	"query": "peter smith",
	"type": "most_fields",
	"operator": "and",
	"fields": [ "first_name", "last_name" ]
	}
	}
	}



	{
	"query": {
	"multi_match": {
	"query": "Poland Street W1V",
	"type": "cross_fields",
	"fields": [ "street", "city", "country", "postcode", "first_name", "last_name" ]
	}
	}
	}

	The cross_fields first analyzes the query string to produce a list of terms and then match the each term to any of the fields listed there.

	Pagination

	Our preceding empty search told us that 14 documents in the cluster match our (empty) query. But there were
	only 10 documents in the hits array. How can we see the other documents?

	size
	Indicates the number of results that should be returned, defaults to 10
	from
	Indicates the number of initial results that should be skipped, defaults to 0

	If you wanted to show five results per page, then pages 1 to 3 could be requested as follows:

	GET /_search?size=5
	GET /_search?size=5&from=5
	GET /_search?size=5&from=10

	How to implement it in our rails applications

	gem 'elasticsearch-model'
	gem 'elasticsearch-persistence'
	gem 'elasticsearch-rails'

	SET UP INDEX

	in user model
	class User < ActiveRecord::Base
	include Elasticsearch::Model

	index_name "users_index"

	settings index: {
	number_of_shards: 1,
	} do
	mapping dynamic: 'false' do
	indexes :first_name, type: "string", index_analyzer: "word_start", search_analyzer: "standard"
	indexes :company, type: "string"
	indexes :no_of_products, type: "long"

	indexes :address do
	indexes :created_at, type: "date"
	end
	end
	end
	end

	query_parameter = {
	query: {
	filtered: {
	filter: {
	bool: {
	must: [
	{ term: { first_name: "Santanu" } },
	{ range: { age: { gt: 30 } },
	]
	}
	}

	}},
	:sort=>[{"created_at"=>"desc"}]
	}
	INDEX DOCUMENT
	User.__elasticsearch__.create_index

	User.__elasticsearch__.create_index! force: true

	user = User.find(10)
	user.__elasticsearch__.index_document

	To delete index
	user.__elasticsearch__.delete_document

	User.import
	User.import(scope: :name_of_scope)
	User.search(query_parameter).records
	User.search(query_parameter).per(10).page(2).records # for pagination