"The quick brown fox jumped over the lazy dog”
“Quick brown foxes leap over lazy dogs in summer”
Inverted index:
- Seperate words and terms (Tokenization)
- Sort unique terms
- List documents containing terms.
Normalised Index:
- Reduce everything to lowercase.
- Remove stop words (e.g. "the")
- Stem words to their root form (e.g. foxes and fox have the same stem)
- Draw on synonyms (e.g. jump and leap can be merged into jump)
Analysis
Is: Tokenization + Normalisation
Analzyers
Are: Tokenizer + Token Filters
E.g. Standard Analyzer is comprised of:
- Standard tokenizer (The, quick, brown, foxes...)
- Lowercase filter (the, quick, brown, foxes...)
- Stopwords filter (quick, brown, foxes...)
E.g. English Analyzer has everything the 'Standard Analyzer' has plus:
- "English stemmer" (quick, brown, fox
es...) - "English stopwords" (
the, quick, brown, fox)
Now, once we do our search (e.g. GET /_search?q=+Quick +foxes), we still don't get anything. This is because we also need to normalise our search query (and not just our search index). Once we search for GET /_search?q=+quick +foxes we get what we need.
If we want a field to be matched exactly, we should set it to be 'not_analyzed':
{ "tweet": {"type": "string", "index": "analyzed" } }
If we want the perks of full text search, we can set it to be analyzed
{ "nickname": {"type": "string", "index": "analyzed" } }
If we want the information stored, but simply not indexed (i.e. not searchable, we can simply set 'index:no'
{ "type": "string", "index": "no" }
If we know a certain string will be english, we can set the type of analzyer. This will be the search and index analyzer:
{ "tweet": {"type": "string", "analyzer": "english" } }
This implies that the tweet is analyzed.