Skip to content

Instantly share code, notes, and snippets.

@nitinstp23
Last active September 29, 2017 06:19
Show Gist options
  • Save nitinstp23/f468dc89f116c9669ac8318e82b2f4cd to your computer and use it in GitHub Desktop.
Save nitinstp23/f468dc89f116c9669ac8318e82b2f4cd to your computer and use it in GitHub Desktop.
Elasticsearch Introduction

Elasticsearch Introduction

What is Elasticsearch?

  • It is a highly scalable, open-source, full-text search engine.
  • It allows you to store and search data quickly and in near real time.
  • It is built on top of Apache Lucene.
  • It is schemaless.
  • It stores data in the form of JSON documents.
  • It has REST Apis for storing and searching data.

ES Components

  • Cluster = Server(s)
  • Node = Server
  • Index = Database
  • Type = Table
  • Document = Record (or row)

Type of Nodes

  • Data Node - Storing the data and performing operations on data (indexing, searching, aggregation, etc.)

  • Master Node - Maintaining the health of the cluster and performing administrative tasks. (creating/deleting indices, tracking which nodes are part of the cluster)

  • Coordinating Node - Receives requests from client applications and aggregates results from Data/Master Nodes.

  • By default a node is a master-eligible node and a data node.

Installing Elasticsearch v5.6.0

  • curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.0.tar.gz
  • tar -xvf elasticsearch-5.6.0.tar.gz
  • cd elasticsearch-5.6.0/bin
  • ./elasticsearch

Installing Kibana v5.6.0

  • curl -L -O https://artifacts.elastic.co/downloads/kibana/kibana-5.6.0-darwin-x86_64.tar.gz
  • tar -xvf kibana-5.6.0-darwin-x86_64.tar.gz
  • cd kibana-5.6.0-darwin-x86_64/bin
  • ./kibana

Start ES

Start Kibana

ES configurations

  • elasticsearch.yml
  • jvm.options

Console

  • Kibana -> Dev Tools -> Console (called Sense previously)

Explore Elasticsearch Cluster

  • GET /
  • GET /_cat/health?v
  • GET /_cat/nodes?v
  • GET /_cat/indices?v

Create an index

PUT library
{
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0
  }
}

Create a Document

PUT /library/books/1
{
  "title": "The quick brown fox",
  "price": 5,
  "colors": ["red", "green", "blue"]
}

Document meta fields

  • _index
  • _type
  • _id
  • _score
  • _source

Create documents in Bulk

  • index is the operation here, along with that we specify the document _id.
POST library/books/_bulk
{ "index": { "_id": 2 } }
{ "title": "The quick brown fox jumps over the lazy dog", "price": 15, "colors": ["blue", "yellow"] }
{ "index": { "_id": 3 } }
{ "title": "The quick brown fox jumps over the lazy dog", "price": 8, "colors": ["red", "blue"] }
{ "index": { "_id": 4 } }
{ "title": "Brown fox brown dog", "price": 2, "colors": ["black", "yellow", "red", "blue"] }
{ "index": { "_id": 5 } }
{ "title": "Lazy dog", "price": 9, "colors": ["red", "blue", "green"] }

Get a Document

GET /library/books/1

Update a Document

  • By re-indexing them (requires all attributes to be specified)
POST /library/books/1
{
  "title": "The quick fantastic fox",
  "price": 5,
  "colors": ["red", "green", "blue"]
}
  • Or by using the update API (you can specify the attribute(s) to be updated)
POST /library/books/1/_update
{
  "doc": {
    "title": "The quick brown fox"
  }
}

Delete a Document

DELETE /library/books/1

Basic Search (Find all documents)

  • This does not do any scoring so all docs have the same score.
  • Get all documents in the books type.
GET library/books/_search

Find all documents having "fox" in their title

  • Get documents having fox in their title field.
GET library/books/_search
{
  "query": {
    "match": {
      "title": "fox"
    }
  }
}

Relevance

  • The relevance score of each document is represented by a positive floating-point number called the _score.
  • The higher the _score, the more relevant the document.
  • A query clause generates a _score for each document.
  • The scoring algorithm used in Elasticsearch is known as TF/IDF (term frequency/inverse document frequency)

Term frequency

  • How often does the term appear in the field?
  • The more often, the more relevant.
  • A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.

Inverse document frequency

  • How often does each term appear in the index?
  • The more often, the less relevant.
  • Terms that appear in many documents have a lower weight than more-uncommon terms.

Field-length norm

  • How long is the field?

  • The longer it is, the less likely it is that words in the field will be relevant.

  • A term appearing in a short title field carries more weight than the same term appearing in a long content field.

  • In case of multiple clauses, the more clauses that match, the higher the _score.

  • In case of multiple query clauses, the _score from each of these query clauses is combined to calculate the overall _score for the document.


Find all "quick" and "dog" documents (match query with multiple terms)

  • Get documents having either quick or dog in their title field.
GET library/books/_search
{
  "query": {
    "match": {
      "title": "quick dog"
    }
  }
}

Find documents with phrase "quick dog" (match_phrase query)

  • Get documents having phrase quick dog in their title field.
GET library/books/_search
{
  "query": {
    "match_phrase": {
      "title": "quick dog"
    }
  }
}

We can also do combinations of queries

  • Let's find all docs with "quick" and "lazy dog".
  • bool query allows us to combine multiple queries
  • must clause is similar to AND in SQL, all conditions inside must match.
GET library/books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "quick"
          }
        },
        {
          "match_phrase": {
            "title": "lazy dog"
          }
        }
      ]
    }
  }
}

Or negate parts of a query

  • Get documents which must not have quick and lazy dog in their title field.
GET library/books/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "title": "quick"
          }
        },
        {
          "match_phrase": {
            "title": "lazy dog"
          }
        }
      ]
    }
  }
}

Let's find all docs with "quick" OR "lazy dog".

  • Combinations can be boosted for different effects.
  • should clause is similar to OR in SQL, either condition inside must match.
GET library/books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": {
              "query": "quick dog"
            }
          }
        },
        {
          "match_phrase": {
            "title": {
              "query": "lazy dog",
              "score": 3
            }
          }
        }
      ]
    }
  }
}

Highlighting matching fragments

  • It tells you what parts of the title field matches
  • You can configure this to use different kinds of emphasis markers.
GET library/books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "title": {
              "query": "quick dog",
              "score": 2
            }
          }
        },
        {
          "match_phrase": {
            "title": {
              "query": "lazy dog"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

Filtering

  • Filtering is often faster than querying, because it doesn't has to calculate score.
  • Get documents that have price more than 5.
GET library/books/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "price": {
            "gt": 5
          }
        }
      }
    }
  }
}

Querying & Filtering together

  • Get documents that have dog in the title and the price is between 5 & 10.
GET library/books/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "dog"
          }
        }
      ],
      "filter": {
        "range": {
          "price": {
            "gte": 5,
            "lte": 10
          }
        }
      }
    }
  }
}

Analysis

  • How does full text search actually works?

  • When documents are indexed, each document undergo an Analysis step.

  • Analysis is a combination of tokenization and token filters.

  • Analysis = Tokenization + Token filters

  • Tokenization = It takes the field and breaks it into multiple parts called tokens

  • Token Filters = It applies some filters on the tokens, to massage into diffrent format.

Tokenization breaks sentences into discrete tokens

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "text": "Brown fox brown dog"
}

And token filters manipulate those tokens

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "Brown fox brown dog"
}

You can combine multiple token filters

GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "unique"],
  "text": "Brown brown brown fox brown fox dog"
}

Instead of specifying a tokenizer and token filter, you can specify an analyzer.

  • Analyzer = A tokenizer + 0 or more token filters
  • This applies the standard tokenizer and standard lowercase token filter.
GET /library/books/_analyze
{
  "analyzer": "standard",
  "text": "Brown fox brown dog"
}

Understanding analysis is very important, because it helps your queries to be more relevant, and the emitted tokens define whether a document matches a query or not.

  • standard tokenizer did not break quick.brown_Fox and removed things like $, @
GET /library/books/_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "THE quick.brown_Fox Jumped! $19.95 @ 3.0"
}

Let's look at the letter tokenizer

  • Now we split quick.brown and brown_Fox
  • but the integers and special chars are ingnored
  • because it only tokenizes alphabets.
GET /library/books/_analyze
{
  "tokenizer": "letter",
  "filter": ["lowercase"],
  "text": "THE quick.brown_Fox Jumped! $19.95 @ 3.0"
}

Another example with uax_url_email tokenizer

  • With standard tokenizer
  • This breaks all the words in the email and the URL
GET /library/books/_analyze
{
  "tokenizer": "standard",
  "text": "elastic@example.com website https://www.elastic.co"
}
  • With uax_url_email tokenizer
  • This does not breaks the email and the URL
GET /library/books/_analyze
{
  "tokenizer": "uax_url_email",
  "text": "elastic@example.com website https://www.elastic.co"
}

Aggregations

  • Can be used to explore your data and get statistics on stored data.

Let's find popular colors (without search results)

GET /library/_search
{
  "size": 0,
  "aggs": {
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      }
    }
  }
}

And you can search/aggregate at the same time

  • Aggregation works on the documents returned by the search query.
GET /library/_search
{
  "query": {
    "match": {
      "title": "dog"
    }
  },
  "aggs": {
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      }
    }
  }
}

Multiple aggregations can be calculated at once and can be nested to further perform calculations.

GET /library/_search
{
  "size": 0,
  "aggs": {
    "price-statistics": {
      "terms": {
        "field": "colors.keyword"
      }
    },
    "popular-colors": {
      "terms": {
        "field": "colors.keyword"
      },
      "aggs": {
        "avg-price-per-color": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

Index Mappings

  • ES is schemaless, when you index a document, ES will try to infer the type of each field in the document.

How to define an index mapping

  • famous-librarians is a new index
  • librarian is the type
  • text field types are analyzed for full-text search
PUT /famous-librarians
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "analysis": {
        "analyzer": {
          "my-desc-analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filters": ["lowercase"]
          }
        }
      }
    }
  },
  "mappings": {
    "librarian": {
      "properties": {
        "name": {
          "type": "text"
        },
        "favorite-colors": {
          "type": "keyword"
        },
        "birth-date": {
          "type": "date",
          "format": "year_month_day"
        },
        "hometown": {
          "type": "geo_point"
        },
        "description": {
          "type": "text",
          "analyzer": "my-desc-analyzer"
        }
      }
    }
  }
}

Get the index mapping

GET /famous-librarians/_mapping

Let's add few documents to the famous-librarians index

PUT /famous-librarians/librarian/1
{
  "name": "Sarah Byrd Askew",
  "favorite-colors": ["yellow", "light-grey"],
  "birth-date": "1877-02-15",
  "hometown": {
    "lat": "32.349722",
    "lon": "-86.641111"
  },
  "description": "An American public librarian who poineered the establishment of libraries in the United States. https://en.wikipedia.org/wiki/Sarah_Byrd_Askew"
}
PUT /famous-librarians/librarian/2
{
  "name": "John J Beckley",
  "favorite-colors": ["red", "white"],
  "birth-date": "1757-08-07",
  "hometown": {
    "lat": "51.507222",
    "lon": "-0.1275"
  },
  "description": "An American political campaign manager and the first Librarian of the United States Congress - https://en.wikipedia.org/wiki/John_J._Beckley"
}

Search librarians

POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "name": "john"
    }
  }
}
POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "description": "https://en.wikipedia.org/wiki/Sarah_Byrd_Askew"
    }
  }
}
POST /famous-librarians/librarian/_search
{
  "query": {
    "match": {
      "description": "https://en.wikipedia.org/wiki/John_J._Beckley"
    }
  }
}

Next Steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment