Last active
May 18, 2016 11:55
-
-
Save santanub/3f17a1d548014afea847 to your computer and use it in GitHub Desktop.
ElasticSearch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Installing elasticsearch | |
1. Visit https://www.elastic.co/downloads | |
2. Download zip version. Unzip it. | |
3. Run bin/elasticsearch | |
4. Visit http://localhost:9200/. If it return status 200, it is succesfully installed. | |
Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for documents. JSON serialization is | |
supported by most programming languages, and has become the standard format used by the NoSQL movement. It is simple, | |
concise, and easy to read. | |
Suppose we have an user object. We can convert the structure and meaning into JSON version. Coverting a object into | |
meaningful JSON is much simpler. | |
{ | |
"email": "sb@kreeti.com", | |
"first_name": "Santanu", | |
"last_name": "Bhattacharya", | |
"info": { | |
"bio": "Eco-warrior and defender of the weak", | |
"age": 30, | |
"interests": [ "dolphins", "whales" ] | |
}, | |
"join_date": "2012/07/06" | |
} | |
Indexing | |
Before searching, we have to store the data. A single document represent a single user. The act of storign data | |
in the elasticsearch is called Indexing. We need to decide where to store these indices. | |
Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns | |
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields | |
If we want to store user indexes, we can create an index named kreeti, types is user. It contains all the details | |
of a single user. When we are indicing | |
PUT /kreeti/user/1 | |
data | |
{ | |
"first_name" : "santanu", | |
"last_name" : "Bhattacharya", | |
"age" : 25, | |
"about" : "I love to go rock climbing", | |
"interests": [ "sports", "music" ] | |
} | |
GET /kreeti/user/1 | |
reponse | |
{ | |
"_index" : "kreeti", | |
"_type" : "user", | |
"_id" : "1", | |
"_version" : 1, | |
"found" : true, | |
"_source" : { | |
"first_name" : "Santanu", | |
"last_name" : "Bhattacharya", | |
"age" : 25, | |
"about" : "I love to go rock climbing", | |
"interests": [ "sports", "music" ] | |
} | |
} | |
GET /kreeti/user/_search | |
Return all the users. Default 10. | |
GET /kreeti/user/_search?q=first_name:Santanu | |
GET /kreeti/user/_search | |
{ | |
"query" : { | |
"match" : { | |
"first_name" : "Santanu" | |
} | |
} | |
} | |
this will return the same as the previous request. This is called DSL(domain specific language). Here the difference | |
is we are no longer using query string but we are using a request body built with a JSON and uses a match query. | |
GET /kreeti/user/_search | |
{ | |
"query" : { | |
"filtered" : { | |
"filter" : { | |
"range" : { | |
"age" : { "gt" : 30 } | |
} | |
}, | |
"query" : { | |
"match" : { | |
"last_name" : "Bhattacharya" | |
} | |
} | |
} | |
} | |
} | |
Full Text Search---We are going to search for all users who enjoy rock climbing: | |
GET /kreeti/user/_search | |
{ | |
"query" : { | |
"match" : { | |
"about" : "rock climbing" | |
} | |
} | |
} | |
By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document | |
matches the query. | |
{ | |
... | |
"hits": { | |
"total": 2, | |
"max_score": 0.16273327, | |
"hits": [ | |
{ | |
... | |
"_score": 0.16273327, | |
"_source": { | |
"first_name": "Santanu", | |
"last_name": "Bhattacharya", | |
"age": 25, | |
"about": "I love to go rock climbing", | |
"interests": [ "sports", "music" ] | |
} | |
}, | |
{ | |
... | |
"_score": 0.016878016, | |
"_source": { | |
"first_name": "Santanu", | |
"last_name": "Karmakar", | |
"age": 32, | |
"about": "I like to collect rock albums", | |
"interests": [ "music" ] | |
} | |
} | |
] | |
} | |
} | |
Phrase Search | |
GET /kreeti/user/_search | |
{ | |
"query" : { | |
"match_phrase" : { | |
"about" : "rock climbing" | |
} | |
} | |
} | |
Finding Exact values | |
POST /my_store/products/_bulk | |
{ "index": { "_id": 1 }} | |
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" } | |
{ "index": { "_id": 2 }} | |
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" } | |
{ "index": { "_id": 3 }} | |
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" } | |
{ "index": { "_id": 4 }} | |
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" } | |
As discussed in query DSL, the search API expects a query not a filter. | |
GET /my_store/products/_search | |
{ | |
"query" : { | |
"filtered" : { | |
"query" : { | |
"match_all" : {} | |
}, | |
"filter" : { | |
"term" : { | |
"price" : 20 | |
} | |
} | |
} | |
} | |
} | |
Filter with Text | |
GET /my_store/products/_search | |
{ | |
"query" : { | |
"filtered" : { | |
"filter" : { | |
"term" : { | |
"productID" : "XHDK-A-1293-#fJ3" | |
} | |
} | |
} | |
} | |
} | |
select * from products where productID = "XHDK-A-1293-#fJ3" | |
SELECT product FROM products WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) | |
GET /my_store/products/_search | |
{ | |
"query" : { | |
"filtered" : { | |
"filter" : { | |
"bool" : { | |
"should" : [ | |
{ "term" : {"price" : 20}}, | |
{ "term" : {"productID" : "XHDK-A-1293-#fJ3"}} | |
], | |
"must_not" : { | |
"term" : {"price" : 30} | |
} | |
} | |
} | |
} | |
} | |
} | |
Bool Filteredit | |
The bool filter is composed of three sections: | |
{ | |
"bool" : { | |
"must" : [], | |
"should" : [], | |
"must_not" : [], | |
} | |
} | |
must | |
All of these clauses must match. The equivalent of AND. | |
must_not | |
All of these clauses must not match. The equivalent of NOT. | |
should | |
At least one of these clauses must match. The equivalent of OR. | |
Single Query String | |
Today if we want to advanced search, user want a single field to type all their search terms. | |
When your only user input is a single query string, you will encounter three scenarios frequently: | |
Best fields | |
When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other. | |
Imagine that we have a website that allows users to search blog posts, such as these two documents: | |
PUT /my_index/my_type/1 | |
{ | |
"title": "Quick brown rabbits", | |
"body": "Brown rabbits are commonly seen." | |
} | |
PUT /my_index/my_type/2 | |
{ | |
"title": "Keeping pets healthy", | |
"body": "My quick brown fox eats rabbits on a regular basis." | |
} | |
The user types in the words “Brown fox” and clicks Search. We don’t know ahead of time if the user’s search terms will be found in the title or the body field of the post, but it is likely that the user is searching for related words. To our eyes, document 2 appears to be the better match, as it contains both words that we are looking for. | |
Now we run the following bool query: | |
{ | |
"query": { | |
"bool": { | |
"should": [ | |
{ "match": { "title": "Brown fox" }}, | |
{ "match": { "body": "Brown fox" }} | |
] | |
} | |
} | |
} | |
dis_max query | |
Instead of the bool query, we can use the dis_max or Disjunction Max Query, return documents that match any of these queries. | |
{ | |
"query": { | |
"dis_max": { | |
"queries": [ | |
{ "match": { "title": "Brown fox" }}, | |
{ "match": { "body": "Brown fox" }} | |
] | |
} | |
} | |
} | |
Most fields | |
A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain. The main field may contain words in their stemmed form, synonyms. | |
Cross fields | |
For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole: | |
User: first_name and last_name | |
Address: street, city, country, and postcode | |
In this case, we want to find as many words as possible in any of the listed fields. We need to search across multiple fields as if they were one big field. | |
User indexed as | |
{ | |
first_name: "santanu", | |
"last_name" : "bhattacharya" | |
} | |
Address indexed as | |
{ | |
street: "sdfsdf", | |
postal_code: "78894556", | |
... | |
} | |
{ | |
"query": { | |
"multi_match": { | |
"query": "peter smith", | |
"type": "most_fields", | |
"operator": "and", | |
"fields": [ "first_name", "last_name" ] | |
} | |
} | |
} | |
{ | |
"query": { | |
"multi_match": { | |
"query": "Poland Street W1V", | |
"type": "cross_fields", | |
"fields": [ "street", "city", "country", "postcode", "first_name", "last_name" ] | |
} | |
} | |
} | |
The cross_fields first analyzes the query string to produce a list of terms and then match the each term to any of the fields listed there. | |
Pagination | |
Our preceding empty search told us that 14 documents in the cluster match our (empty) query. But there were | |
only 10 documents in the hits array. How can we see the other documents? | |
size | |
Indicates the number of results that should be returned, defaults to 10 | |
from | |
Indicates the number of initial results that should be skipped, defaults to 0 | |
If you wanted to show five results per page, then pages 1 to 3 could be requested as follows: | |
GET /_search?size=5 | |
GET /_search?size=5&from=5 | |
GET /_search?size=5&from=10 | |
How to implement it in our rails applications | |
gem 'elasticsearch-model' | |
gem 'elasticsearch-persistence' | |
gem 'elasticsearch-rails' | |
SET UP INDEX | |
in user model | |
class User < ActiveRecord::Base | |
include Elasticsearch::Model | |
index_name "users_index" | |
settings index: { | |
number_of_shards: 1, | |
} do | |
mapping dynamic: 'false' do | |
indexes :first_name, type: "string", index_analyzer: "word_start", search_analyzer: "standard" | |
indexes :company, type: "string" | |
indexes :no_of_products, type: "long" | |
indexes :address do | |
indexes :created_at, type: "date" | |
end | |
end | |
end | |
end | |
query_parameter = { | |
query: { | |
filtered: { | |
filter: { | |
bool: { | |
must: [ | |
{ term: { first_name: "Santanu" } }, | |
{ range: { age: { gt: 30 } }, | |
] | |
} | |
} | |
}}, | |
:sort=>[{"created_at"=>"desc"}] | |
} | |
INDEX DOCUMENT | |
User.__elasticsearch__.create_index | |
User.__elasticsearch__.create_index! force: true | |
user = User.find(10) | |
user.__elasticsearch__.index_document | |
To delete index | |
user.__elasticsearch__.delete_document | |
User.import | |
User.import(scope: :name_of_scope) | |
User.search(query_parameter).records | |
User.search(query_parameter).per(10).page(2).records # for pagination |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment