Skip to content

Instantly share code, notes, and snippets.

@viztastic
Last active September 29, 2016 12:55
Show Gist options
  • Save viztastic/dcee515928c809f20eddd284c5da41d0 to your computer and use it in GitHub Desktop.
Save viztastic/dcee515928c809f20eddd284c5da41d0 to your computer and use it in GitHub Desktop.
Quick notes to help me get my head around basics of elastic. What is an inverted index, what's a normalized index, what's tokenisation vs. normalisation etc..

Elasticsearch 101 - Inverted Index, Normalistion and Analyzers.

Examples

"The quick brown fox jumped over the lazy dog”

“Quick brown foxes leap over lazy dogs in summer”

Inverted index:

  1. Seperate words and terms (Tokenization)
  2. Sort unique terms
  3. List documents containing terms.

Normalised Index:

  1. Reduce everything to lowercase.
  2. Remove stop words (e.g. "the")
  3. Stem words to their root form (e.g. foxes and fox have the same stem)
  4. Draw on synonyms (e.g. jump and leap can be merged into jump)

Analysis

Is: Tokenization + Normalisation

Analzyers

Are: Tokenizer + Token Filters

For example:

E.g. Standard Analyzer is comprised of:

  • Standard tokenizer (The, quick, brown, foxes...)
  • Lowercase filter (the, quick, brown, foxes...)
  • Stopwords filter (quick, brown, foxes...)

E.g. English Analyzer has everything the 'Standard Analyzer' has plus:

  • "English stemmer" (quick, brown, foxes...)
  • "English stopwords" (the, quick, brown, fox)

Testing this...

Now, once we do our search (e.g. GET /_search?q=+Quick +foxes), we still don't get anything. This is because we also need to normalise our search query (and not just our search index). Once we search for GET /_search?q=+quick +foxes we get what we need.

Exact vs Full Text

To Analyze or not to Analyze

If we want a field to be matched exactly, we should set it to be 'not_analyzed':

{ "tweet": {"type": "string", "index": "analyzed" } }

If we want the perks of full text search, we can set it to be analyzed

{ "nickname": {"type": "string", "index": "analyzed" } }

If we want the information stored, but simply not indexed (i.e. not searchable, we can simply set 'index:no'

{ "type": "string", "index": "no" }

Types of Analyzers

If we know a certain string will be english, we can set the type of analzyer. This will be the search and index analyzer:

{ "tweet": {"type": "string", "analyzer": "english" } }

This implies that the tweet is analyzed.

Elasticsearch 101 - Querying

Building Queries

GET /_search?q=STRING is not the recommended way to search.

We should pass in proper full body searches like:

  • Find all documents:
GET /_search
    '{
        "query": {
        "match_all": {}
      },
         "from": 0,
         "size": 10
     }'
  • More realistically, something like this: find all documents containing "car" in the "tweet" field:
GET /_search
    '{
        "query": {
        "match": { "tweet" : "car" }
      },
         "from": 0,
         "size": 10
     }'

Filters vs. Queries

Filters

  • Exact matching
  • Binary yes/no
  • Fast
  • Cacheable

Queries

  • Full text search
  • Relevance scoring
  • Heavier (i.e. more taxing performance wise)
  • Not cacheable

You can either, or both.

Querying and Filtering

Need to wrap the query and filter in a "filtered" property within the query, as per below:

GET /_search
    {
      "query": {
        "filtered": {
          "query": {
            "match": { "tweet": "search" }
          },
          "filter": {
            "term": { "nick": "@mary" }
          }
       }
     }
    }'

Just Filtering

GET /_search
    {
      "query": {
        "filtered": {
          "query": {
            "match_all": {}
          },
          "filter": {
            "term": { "nick": "@mary" }
          }
       }
     }
    }'

which is the same as:

GET /_search
    {
      "query": {
        "filtered": {
          "filter": {
            "term": { "nick": "@mary" }
          }
       }
     }
    }'

You could also specify a sorting mechanism:

GET /_search
    {
      "query": {
        "filtered": {
          "filter": {
            "term": { "nick": "@mary" }
          }
       }
     },
      "sort": {"date":"desc"}
    }'

There are different types of filters, for example, the range filter, to return results within the month of May (for example):

GET /_search
    {
      "query": {
        "filtered": {
          "filter": {
            "range": { 
                "date": {
                 "gte": "2016-05-01",
                 "lte": "2016-05-31"
                 }
          }
       }
     },
      "sort": {"date":"desc"}
    }'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment