Skip to content

Instantly share code, notes, and snippets.

@Tapan
Last active January 25, 2020 22:42
Show Gist options
  • Save Tapan/9802b8178363b3de61e1b2aea1b9b697 to your computer and use it in GitHub Desktop.
Save Tapan/9802b8178363b3de61e1b2aea1b9b697 to your computer and use it in GitHub Desktop.
Elasticsearch
Analyzer
1. character filter => add, remove or transform the text i.e strip html tag from the document.
2. tokenizer => split text into terms.
3. token filter => consists of zero or more filters like uppercase, synonyms, stop words etc.
Analyze API
POST _analyze
{
"tokenizer": "standard",
"text": "I'm in the mood for drinking semi-dry red wine!"
}
POST _analyze
{
"filter": ["lowercase"],
"char_filter": ["html_strip"],
"text": "I'm in the mood for drinking semi-dry red wine!"
}
POST _analyze
{
"analyzer": "standard",
"text": "I'm in the mood for drinking semi-dry red wine!"
}
Inverted index
The result of analysis is stored in inverted index
Inverted index is mapped to a text field. A index consisting of two full text field will have two inverted index
corresponding to the text field.
Below is inverted index for the field title:
Term Document #1 Document #2
best tick mark
carborana tick mark
delicious tick mark
pasta tick mark tick mark
Tokenizer
1. Word tokenizer eg. standard tokenizer
Letter tokenizer => divides text into terms when encountering a character that is not a letter
lowercase tokenizer => lowercase all terms
whitespace tokenizer => divides text into terms when encountering whitespace
uax url email tokenizer => treats url and emails as single tokens
2. Partial word tokenizer
breaks up text or words into small fragments. Used for partial word matching.
eg N-Ggram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of the specified length.
"Red Wine"
[re, red, ed, wi, win, wine, in, ine, ne]
edge-ngram tokenizer => breaks text into words when encountering certain characters and then emits N-grams of each word beginning from the start of the word.
"Red Wine"
[Re, Red, Wi, Win, Wine]
3. Structured text tokenizer
Used for structured text such as email, zipcodes, identifiers etc
keyword tokenizers
pattern tokenizers
path tokenizers => splits hierarchical values (eg. file systems path) and emits a term for each component in the tree
Token filters
standard token filters
lowercase filter
uppercase filter
n-gram filter
edge-ngram token filter
stop filter
word_delimter filter => splits words into subwords and performs transformations on subwords groups.
[Wi-Fi, PowerShell] => [Wi, Fi, Power, Shell]
stemmer token filter
keyword marker token filter(keyword_marker)
=> protects words from being modified by stemmers
snowball token filter
synonyms
trim
Analyzer
standard
whitespace
simple
keyword
stop
pattern
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment