nitinstp23/es_talk_notes.md

## es_talk_notes.md

      
    Raw
  

              es_talk_notes.md
            
          
    Elasticsearch Introduction

What is Elasticsearch?


It is a highly scalable, open-source, full-text search engine.
It allows you to store and search data quickly and in near real time.
It is built on top of Apache Lucene.
It is schemaless.
It stores data in the form of JSON documents.
It has REST Apis for storing and searching data.

ES Components


Cluster = Server(s)
Node = Server
Index = Database
Type = Table
Document = Record (or row)

Type of Nodes


Data Node   - Storing the data and performing operations on data (indexing, searching, aggregation, etc.)


Master Node - Maintaining the health of the cluster and performing administrative tasks. (creating/deleting indices, tracking which nodes are part of the cluster)


Coordinating Node - Receives requests from client applications and aggregates results from Data/Master Nodes.


By default a node is a master-eligible node and a data node.


Installing Elasticsearch v5.6.0


curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.tar.gz
tar -xvf elasticsearch-5.6.0.tar.gz
cd elasticsearch-5.6.0/bin
./elasticsearch

Installing Kibana v5.6.0


curl -L -O https://artifacts.elastic.co/downloads/kibana/kibana-5.6.0-darwin-x86_64.tar.gz
tar -xvf kibana-5.6.0-darwin-x86_64.tar.gz
cd kibana-5.6.0-darwin-x86_64/bin
./kibana

Start ES


cd ~/elasticsearch-5.6.0
bin/elasticsearch => (http://localhost:9200)

Start Kibana


cd ~/kibana-5.6.0-darwin-x86_64/
bin/kibana => (http://localhost:5601)

ES configurations


elasticsearch.yml
jvm.options

Console


Kibana -> Dev Tools -> Console (called Sense previously)

Explore Elasticsearch Cluster


GET /
GET /_cat/health?v
GET /_cat/nodes?v
GET /_cat/indices?v


Create an index

PUT library
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0
  }
}
Create a Document

PUT /library/books/1
{
  "title": "The quick brown fox",
  "price": 5,
  "colors": ["red", "green", "blue"]
}
Document meta fields


_index
_type
_id
_score
_source

Create documents in Bulk


index is the operation here, along with that we specify the document _id.

POST library/books/_bulk
{ "index": { "_id": 2 } }
{ "title": "The quick brown fox jumps over the lazy dog", "price": 15, "colors": ["blue", "yellow"] }
{ "index": { "_id": 3 } }
{ "title": "The quick brown fox jumps over the lazy dog", "price": 8, "colors": ["red", "blue"] }
{ "index": { "_id": 4 } }
{ "title": "Brown fox brown dog", "price": 2, "colors": ["black", "yellow", "red", "blue"] }
{ "index": { "_id": 5 } }
{ "title": "Lazy dog", "price": 9, "colors": ["red", "blue", "green"] }
Get a Document

GET /library/books/1

Update a Document


By re-indexing them (requires all attributes to be specified)

POST /library/books/1
{
  "title": "The quick fantastic fox",
  "price": 5,
  "colors": ["red", "green", "blue"]
}

Or by using the update API (you can specify the attribute(s) to be updated)

POST /library/books/1/_update
{
  "doc": {
    "title": "The quick brown fox"
  }
}
Delete a Document

DELETE /library/books/1


Basic Search (Find all documents)


This does not do any scoring so all docs have the same score.
Get all documents in the books type.

GET library/books/_search

Find all documents having "fox" in their title


Get documents having fox in their title field.

GET library/books/_search
{
  "query": {
    "match": {
      "title": "fox"
    }
  }
}

Relevance


The relevance score of each document is represented by a positive floating-point number called the _score.
The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
The scoring algorithm used in Elasticsearch is known as TF/IDF (term frequency/inverse document frequency)

Term frequency


How often does the term appear in the field?
The more often, the more relevant.
A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.

Inverse document frequency


How often does each term appear in the index?
The more often, the less relevant.
Terms that appear in many documents have a lower weight than more-uncommon terms.

Field-length norm


How long is the field?


The longer it is, the less likely it is that words in the field will be relevant.


A term appearing in a short title field carries more weight than the same term appearing in a long content field.


In case of multiple clauses, the more clauses that match, the higher the _score.


In case of multiple query clauses, the _score from each of these query clauses is combined to calculate the overall _score for the document.


Find all "quick" and "dog" documents (match query with multiple terms)


Get documents having either quick or dog in their title field.

GET library/books/_search
{
  "query": {
    "match": {
      "title": "quick dog"
    }
  }
}
Find documents with phrase "quick dog" (match_phrase query)


Get documents having phrase quick dog in their title field.

GET library/books/_search
{
  "query": {
    "match_phrase": {
      "title": "quick dog"
    }
  }
}
We can also do combinations of queries


Let's find all docs with "quick" and "lazy dog".
bool query allows us to combine multiple queries
must clause is similar to AND in SQL, all conditions inside must match.

GET library/books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "quick"
          }
        },
        {
          "match_phrase": {
            "title": "lazy dog"
          }
        }
      ]
    }
  }
}
Or negate parts of a query


Get documents which must not have quick and lazy dog in their title field.

GET library/books/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "title": "quick"
          }
        },
        {
          "match_phrase": {
            "title": "lazy dog"
          }
        }
      ]
    }
  }
}
Let's find all docs with "quick" OR "lazy dog".


Combinations can be boosted for different effects.
should clause is similar to OR in SQL, either condition inside must match.

GET library/books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": {
              "query": "quick dog"
            }
          }
        },
        {
          "match_phrase": {
            "title": {
              "query": "lazy dog",
              "score": 3
            }
          }
        }
      ]
    }
  }
}
Highlighting matching fragments


It tells you what parts of the title field matches
You can configure this to use different kinds of emphasis markers.

GET library/books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": {
              "query": "quick dog",
              "score": 2
            }
          }
        },
        {
          "match_phrase": {
            "title": {
              "query": "lazy dog"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}
Filtering


Filtering is often faster than querying, because it doesn't has to calculate score.
Get documents that have price more than 5.

GET library/books/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "price": {
            "gt": 5
          }
        }
      }
    }
  }
}
Querying & Filtering together


Get documents that have dog in the title and the price is between 5 & 10.

GET library/books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "dog"
          }
        }
      ],
      "filter": {
        "range": {
          "price": {
            "gte": 5,
            "lte": 10
          }
        }
      }
    }
  }
}

Analysis


How does full text search actually works?


When documents are indexed, each document undergo an Analysis step.


Analysis is a combination of tokenization and token filters.


Analysis = Tokenization + Token filters


Tokenization  = It takes the field and breaks it into multiple parts called tokens


Token Filters = It applies some filters on the tokens, to massage into diffrent format.


Tokenization breaks sentences into discrete tokens

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "text": "Brown fox brown dog"
}
And token filters manipulate those tokens

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "Brown fox brown dog"
}
You can combine multiple token filters

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "unique"],
  "text": "Brown brown brown fox brown fox dog"
}
Instead of specifying a tokenizer and token filter, you can specify an analyzer.


Analyzer = A tokenizer + 0 or more token filters
This applies the standard tokenizer and standard lowercase token filter.

GET /library/books/_analyze
{
  "analyzer": "standard",
  "text": "Brown fox brown dog"
}
Understanding analysis is very important, because it helps your queries to be more relevant, and the emitted tokens define whether a document matches a query or not.


standard tokenizer did not break quick.brown_Fox and
removed things like $, @

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "THE quick.brown_Fox Jumped! $19.95 @ 3.0"
}
Let's look at the letter tokenizer


Now we split quick.brown and brown_Fox
but the integers and special chars are ingnored
because it only tokenizes alphabets.

GET /library/books/_analyze
{
  "tokenizer": "letter",
  "filter": ["lowercase"],
  "text": "THE quick.brown_Fox Jumped! $19.95 @ 3.0"
}
Another example with uax_url_email tokenizer


With standard tokenizer
This breaks all the words in the email and the URL

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "text": "elastic@example.com website https://www.elastic.co"
}

With uax_url_email tokenizer
This does not breaks the email and the URL

GET /library/books/_analyze
{
  "tokenizer": "uax_url_email",
  "text": "elastic@example.com website https://www.elastic.co"
}

Aggregations


Can be used to explore your data and get statistics on stored data.

Let's find popular colors (without search results)

GET /library/_search
{
  "size": 0,
  "aggs": {
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      }
    }
  }
}
And you can search/aggregate at the same time


Aggregation works on the documents returned by the search query.

GET /library/_search
{
  "query": {
    "match": {
      "title": "dog"
    }
  },
  "aggs": {
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      }
    }
  }
}
Multiple aggregations can be calculated at once and can be nested to further perform calculations.

GET /library/_search
{
  "size": 0,
  "aggs": {
    "price-statistics": {
      "terms": {
        "field": "colors.keyword"
      }
    },
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      },
      "aggs": {
        "avg-price-per-color": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

Index Mappings


ES is schemaless, when you index a document, ES will try to infer the type of each field in the document.

How to define an index mapping


famous-librarians is a new index
librarian is the type
text field types are analyzed for full-text search

PUT /famous-librarians
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "analysis": {
        "analyzer": {
          "my-desc-analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filters": ["lowercase"]
          }
        }
      }
    }
  },
  "mappings": {
    "librarian": {
      "properties": {
        "name": {
          "type": "text"
        },
        "favorite-colors": {
          "type": "keyword"
        },
        "birth-date": {
          "type": "date",
          "format": "year_month_day"
        },
        "hometown": {
          "type": "geo_point"
        },
        "description": {
          "type": "text",
          "analyzer": "my-desc-analyzer"
        }
      }
    }
  }
}
Get the index mapping

GET /famous-librarians/_mapping
Let's add few documents to the famous-librarians index

PUT /famous-librarians/librarian/1
{
  "name": "Sarah Byrd Askew",
  "favorite-colors": ["yellow", "light-grey"],
  "birth-date": "1877-02-15",
  "hometown": {
    "lat": "32.349722",
    "lon": "-86.641111"
  },
  "description": "An American public librarian who poineered the establishment of libraries in the United States. https://en.wikipedia.org/wiki/Sarah_Byrd_Askew"
}
PUT /famous-librarians/librarian/2
{
  "name": "John J Beckley",
  "favorite-colors": ["red", "white"],
  "birth-date": "1757-08-07",
  "hometown": {
    "lat": "51.507222",
    "lon": "-0.1275"
  },
  "description": "An American political campaign manager and the first Librarian of the United States Congress - https://en.wikipedia.org/wiki/John_J._Beckley"
}
Search librarians

POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "name": "john"
    }
  }
}
POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "description": "https://en.wikipedia.org/wiki/Sarah_Byrd_Askew"
    }
  }
}
POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "description": "https://en.wikipedia.org/wiki/John_J._Beckley"
    }
  }
}
Next Steps


ElasticSearch Documentation - https://www.elastic.co/guide/index.html