Tapan/elasticsearch.txt

## elasticsearch.txt
Analyzer
  1. character filter => add, remove or transform the text i.e strip html tag from the document.
  2. tokenizer => split text into terms.
  3. token filter => consists of zero or more filters like uppercase, synonyms, stop words etc.

Analyze API

  POST _analyze
  {
    "tokenizer": "standard",
    "text": "I'm in the mood for drinking semi-dry red wine!"
  }

  POST _analyze
  {
    "filter": ["lowercase"],
    "char_filter": ["html_strip"],
    "text": "I'm in the mood for drinking semi-dry red wine!"
  }

  POST _analyze
  {
    "analyzer": "standard",
    "text": "I'm in the mood for drinking semi-dry red wine!"
  }

Inverted index
  The result of analysis is stored in inverted index
  Inverted index is mapped to a text field. A index consisting of two full text field will have two inverted index
  corresponding to the text field.

  Below is inverted index for the field title:

  Term      Document #1  Document #2

  best        tick mark
  carborana                tick mark
  delicious   tick mark
  pasta       tick mark    tick mark

Tokenizer
 1. Word tokenizer eg. standard tokenizer
    Letter tokenizer => divides text into terms when encountering a character that is not a letter
    lowercase tokenizer => lowercase all terms
    whitespace tokenizer => divides text into terms when encountering whitespace
    uax url email tokenizer => treats url and emails as single tokens

 2. Partial word tokenizer
    breaks up text or words into small fragments. Used for partial word matching.
    eg N-Ggram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of the  specified length.
    "Red Wine"
    [re, red, ed, wi, win, wine, in, ine, ne]

    edge-ngram tokenizer =>  breaks text into words when encountering certain characters and then emits N-grams of each word beginning from the start of the word.


    "Red Wine"
    [Re, Red, Wi, Win, Wine]

 3. Structured text tokenizer
   Used for structured text such as email, zipcodes, identifiers etc

   keyword tokenizers
   pattern tokenizers
   path tokenizers => splits hierarchical values (eg. file systems path) and emits a term for each component in the tree

 Token filters
   standard token filters
   lowercase filter
   uppercase filter
   n-gram filter
   edge-ngram token filter
   stop filter
   word_delimter filter => splits words into subwords and performs transformations on subwords groups.
   [Wi-Fi, PowerShell] => [Wi, Fi, Power, Shell]
   stemmer token filter
   keyword marker token filter(keyword_marker)
     => protects words from being modified by stemmers
   snowball token filter
   synonyms
   trim

 Analyzer
   standard
   whitespace
   simple
   keyword
   stop
   pattern
	Analyzer
	1. character filter => add, remove or transform the text i.e strip html tag from the document.
	2. tokenizer => split text into terms.
	3. token filter => consists of zero or more filters like uppercase, synonyms, stop words etc.

	Analyze API

	POST _analyze
	{
	"tokenizer": "standard",
	"text": "I'm in the mood for drinking semi-dry red wine!"
	}

	POST _analyze
	{
	"filter": ["lowercase"],
	"char_filter": ["html_strip"],
	"text": "I'm in the mood for drinking semi-dry red wine!"
	}

	POST _analyze
	{
	"analyzer": "standard",
	"text": "I'm in the mood for drinking semi-dry red wine!"
	}

	Inverted index
	The result of analysis is stored in inverted index
	Inverted index is mapped to a text field. A index consisting of two full text field will have two inverted index
	corresponding to the text field.

	Below is inverted index for the field title:

	Term Document #1 Document #2

	best tick mark
	carborana tick mark
	delicious tick mark
	pasta tick mark tick mark

	Tokenizer
	1. Word tokenizer eg. standard tokenizer
	Letter tokenizer => divides text into terms when encountering a character that is not a letter
	lowercase tokenizer => lowercase all terms
	whitespace tokenizer => divides text into terms when encountering whitespace
	uax url email tokenizer => treats url and emails as single tokens

	2. Partial word tokenizer
	breaks up text or words into small fragments. Used for partial word matching.
	eg N-Ggram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of the specified length.
	"Red Wine"
	[re, red, ed, wi, win, wine, in, ine, ne]

	edge-ngram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of each word beginning from the start of the word.


	"Red Wine"
	[Re, Red, Wi, Win, Wine]

	3. Structured text tokenizer
	Used for structured text such as email, zipcodes, identifiers etc

	keyword tokenizers
	pattern tokenizers
	path tokenizers => splits hierarchical values (eg. file systems path) and emits a term for each component in the tree

	Token filters
	standard token filters
	lowercase filter
	uppercase filter
	n-gram filter
	edge-ngram token filter
	stop filter
	word_delimter filter => splits words into subwords and performs transformations on subwords groups.
	[Wi-Fi, PowerShell] => [Wi, Fi, Power, Shell]
	stemmer token filter
	keyword marker token filter(keyword_marker)
	=> protects words from being modified by stemmers
	snowball token filter
	synonyms
	trim

	Analyzer
	standard
	whitespace
	simple
	keyword
	stop
	pattern