Skip to content

Instantly share code, notes, and snippets.

@santanub
Last active May 18, 2016 11:55
Show Gist options
  • Save santanub/3f17a1d548014afea847 to your computer and use it in GitHub Desktop.
Save santanub/3f17a1d548014afea847 to your computer and use it in GitHub Desktop.
ElasticSearch
Installing elasticsearch
1. Visit https://www.elastic.co/downloads
2. Download zip version. Unzip it.
3. Run bin/elasticsearch
4. Visit http://localhost:9200/. If it return status 200, it is succesfully installed.
Elasticsearch uses JavaScript Object Notation, or JSON, as the serialization format for documents. JSON serialization is
supported by most programming languages, and has become the standard format used by the NoSQL movement. It is simple,
concise, and easy to read.
Suppose we have an user object. We can convert the structure and meaning into JSON version. Coverting a object into
meaningful JSON is much simpler.
{
"email": "sb@kreeti.com",
"first_name": "Santanu",
"last_name": "Bhattacharya",
"info": {
"bio": "Eco-warrior and defender of the weak",
"age": 30,
"interests": [ "dolphins", "whales" ]
},
"join_date": "2012/07/06"
}
Indexing
Before searching, we have to store the data. A single document represent a single user. The act of storign data
in the elasticsearch is called Indexing. We need to decide where to store these indices.
Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
If we want to store user indexes, we can create an index named kreeti, types is user. It contains all the details
of a single user. When we are indicing
PUT /kreeti/user/1
data
{
"first_name" : "santanu",
"last_name" : "Bhattacharya",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}
GET /kreeti/user/1
reponse
{
"_index" : "kreeti",
"_type" : "user",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"first_name" : "Santanu",
"last_name" : "Bhattacharya",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}
}
GET /kreeti/user/_search
Return all the users. Default 10.
GET /kreeti/user/_search?q=first_name:Santanu
GET /kreeti/user/_search
{
"query" : {
"match" : {
"first_name" : "Santanu"
}
}
}
this will return the same as the previous request. This is called DSL(domain specific language). Here the difference
is we are no longer using query string but we are using a request body built with a JSON and uses a match query.
GET /kreeti/user/_search
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"age" : { "gt" : 30 }
}
},
"query" : {
"match" : {
"last_name" : "Bhattacharya"
}
}
}
}
}
Full Text Search---We are going to search for all users who enjoy rock climbing:
GET /kreeti/user/_search
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
}
By default, Elasticsearch sorts matching results by their relevance score, that is, by how well each document
matches the query.
{
...
"hits": {
"total": 2,
"max_score": 0.16273327,
"hits": [
{
...
"_score": 0.16273327,
"_source": {
"first_name": "Santanu",
"last_name": "Bhattacharya",
"age": 25,
"about": "I love to go rock climbing",
"interests": [ "sports", "music" ]
}
},
{
...
"_score": 0.016878016,
"_source": {
"first_name": "Santanu",
"last_name": "Karmakar",
"age": 32,
"about": "I like to collect rock albums",
"interests": [ "music" ]
}
}
]
}
}
Phrase Search
GET /kreeti/user/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
}
}
Finding Exact values
POST /my_store/products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }
As discussed in query DSL, the search API expects a query not a filter.
GET /my_store/products/_search
{
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
},
"filter" : {
"term" : {
"price" : 20
}
}
}
}
}
Filter with Text
GET /my_store/products/_search
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"productID" : "XHDK-A-1293-#fJ3"
}
}
}
}
}
select * from products where productID = "XHDK-A-1293-#fJ3"
SELECT product FROM products WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30)
GET /my_store/products/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"should" : [
{ "term" : {"price" : 20}},
{ "term" : {"productID" : "XHDK-A-1293-#fJ3"}}
],
"must_not" : {
"term" : {"price" : 30}
}
}
}
}
}
}
Bool Filteredit
The bool filter is composed of three sections:
{
"bool" : {
"must" : [],
"should" : [],
"must_not" : [],
}
}
must
All of these clauses must match. The equivalent of AND.
must_not
All of these clauses must not match. The equivalent of NOT.
should
At least one of these clauses must match. The equivalent of OR.
Single Query String
Today if we want to advanced search, user want a single field to type all their search terms.
When your only user input is a single query string, you will encounter three scenarios frequently:
Best fields
When searching for words that represent a concept, such as “brown fox,” the words mean more together than they do individually. Fields like the title and body, while related, can be considered to be in competition with each other.
Imagine that we have a website that allows users to search blog posts, such as these two documents:
PUT /my_index/my_type/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
PUT /my_index/my_type/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
The user types in the words “Brown fox” and clicks Search. We don’t know ahead of time if the user’s search terms will be found in the title or the body field of the post, but it is likely that the user is searching for related words. To our eyes, document 2 appears to be the better match, as it contains both words that we are looking for.
Now we run the following bool query:
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
dis_max query
Instead of the bool query, we can use the dis_max or Disjunction Max Query, return documents that match any of these queries.
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
Most fields
A common technique for fine-tuning relevance is to index the same data into multiple fields, each with its own analysis chain. The main field may contain words in their stemmed form, synonyms.
Cross fields
For some entities, the identifying information is spread across multiple fields, each of which contains just a part of the whole:
User: first_name and last_name
Address: street, city, country, and postcode
In this case, we want to find as many words as possible in any of the listed fields. We need to search across multiple fields as if they were one big field.
User indexed as
{
first_name: "santanu",
"last_name" : "bhattacharya"
}
Address indexed as
{
street: "sdfsdf",
postal_code: "78894556",
...
}
{
"query": {
"multi_match": {
"query": "peter smith",
"type": "most_fields",
"operator": "and",
"fields": [ "first_name", "last_name" ]
}
}
}
{
"query": {
"multi_match": {
"query": "Poland Street W1V",
"type": "cross_fields",
"fields": [ "street", "city", "country", "postcode", "first_name", "last_name" ]
}
}
}
The cross_fields first analyzes the query string to produce a list of terms and then match the each term to any of the fields listed there.
Pagination
Our preceding empty search told us that 14 documents in the cluster match our (empty) query. But there were
only 10 documents in the hits array. How can we see the other documents?
size
Indicates the number of results that should be returned, defaults to 10
from
Indicates the number of initial results that should be skipped, defaults to 0
If you wanted to show five results per page, then pages 1 to 3 could be requested as follows:
GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10
How to implement it in our rails applications
gem 'elasticsearch-model'
gem 'elasticsearch-persistence'
gem 'elasticsearch-rails'
SET UP INDEX
in user model
class User < ActiveRecord::Base
include Elasticsearch::Model
index_name "users_index"
settings index: {
number_of_shards: 1,
} do
mapping dynamic: 'false' do
indexes :first_name, type: "string", index_analyzer: "word_start", search_analyzer: "standard"
indexes :company, type: "string"
indexes :no_of_products, type: "long"
indexes :address do
indexes :created_at, type: "date"
end
end
end
end
query_parameter = {
query: {
filtered: {
filter: {
bool: {
must: [
{ term: { first_name: "Santanu" } },
{ range: { age: { gt: 30 } },
]
}
}
}},
:sort=>[{"created_at"=>"desc"}]
}
INDEX DOCUMENT
User.__elasticsearch__.create_index
User.__elasticsearch__.create_index! force: true
user = User.find(10)
user.__elasticsearch__.index_document
To delete index
user.__elasticsearch__.delete_document
User.import
User.import(scope: :name_of_scope)
User.search(query_parameter).records
User.search(query_parameter).per(10).page(2).records # for pagination
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment