#ElasticSearch
#Features
- ES based on Apache Lucene
- Written in Java
- Focus on scalability and distributed from the ground up
- Designed to take, analyze and search data from any source
- JSON-Schema
Communication is done through HTTP REST API
- curl -X <REST Verb> <Node>:<Port>/<Index>/<Type>/<ID>
- curl -X GET http://uvinum.marcos.vm:9200/person/employee/123
#Terminology
RealTime Engine
: One change to an index it is propagated through its entire cluster. For large clusters we are talking about a delay of 1 second.Cluster
: Collection ofNodes
. So generally a cluster consists of one or moreNodes
. Depending on the scale. A Cluster provides indexing and searching capabilities across all nodes. By default named "elasticsearch"Nodes
: A single server, taking part of a cluster. Stores the searchable data. Stores all data if there is only one node, otherwise stores part of the data. By default joinsCluster
named "elasticsearch"Index
: Collection ofDocuments
(same of differentTypes
). Indentified by a lowercased name. Used when indexing, searching, updating and deleting documents. Can have one or moreTypes
defined. Can have as manyDocuments
as you want.Type
: Class/Category of similarDocuments
. ex: product, account, user. Consist of a name and aMapping
(mapping do not need to be explicitly defined). Stored as metadata_type
(when searching it implies filter by this field)Mapping
: Describes the fields of a document of a givenType
: string, integer... Also includes HOW fields should be indexed and stored. If not mapping defined the Dynamic Mapping would defined the mapping based on the fields it finds.Document
: Basic unit of information that can be indexed. Key-Value fields. Can be a single User, Order, Product... JSON based.
Similarity with a Relational Database:
Index
-> Database,Type
-> Table,Document
-> Row
Shards
: Pieces whereIndexes
can be divided into. Fully functional and independent index. Its number can be specified when creating anIndex
(default is 5). Allow to scale horizontally as when anIndex
contains lot of data and itsNode
does not have enough storage space, another Shard can be created into another differentNode
to provide the necessary space. Allows distribution and parallelization which means performance.Replicas
: A copy of aShard
(default is 1). A replica never resides on the sameNode
as its originalShard
. Provides high availability when aShard
orNode
fails and allows scaling search volume because search queries can be executed on all replicas in parallel.
So by Default for each
Index
would be 5Shards
and 1Replica
for each of itsShards
(5Replicas
).
#Basic Requests
We should use an HTTP Client to perform the requests. (Ex: Postman)
Request: PUT "http://127.0.0.1:9200/ecommerce"
Result: Creates an index ecommerce
Request: PUT "http://127.0.0.1:9200/ecommerce/product/1001"
Body:
{
"name": "JellyTime",
"price": "100.00",
"description": "Peanut Butter JellyTime !",
"status": "active",
"quantity": 1,
"categories":[
{"name": "Toys"}
],
"tags": ["jellytime", "toys", "yellow"]
}
Result: Creates a Document to the Index ecommerce
of the type product
with id 1001
We can replace the Document by changing the body provided and sending a PUT request with all the same fields.
Request: POST "http://127.0.0.1:9200/ecommerce/product/1001/_update"
Body:
{
"doc":{
"price": 50.00
}
}
Result: It will update ONLY the specified fields of the current Document.
Request:DELETE "http://127.0.0.1:9200/ecommerce"
Result: Deletes the index ecommerce
Request:DELETE "http://127.0.0.1:9200/ecommerce/product/1001"
Result: Deletes the Document 1001
of the type product
inside the index ecommerce
Request: GET "http://127.0.0.1:9200/_cat/indices?v"
Result: Show Indeces information in our Cluster
By default if not specified, ES creates Documents with the data provided
Request: POST "htpp://127.0.0.1:9200/ecommerce/product/_bulk"
Body:
{"index":{"_id":"1"}}
{
"name":"Stainless Steel Cleaner Vision",
"price":"108.11",
"description":"Nullam orci pede, venenatis non, sodales sed, tincidunt eu, felis. Fusce posuere felis sed lacus. Morbi sem mauris, laoreet ut, rhoncus aliquet, pulvinar sed, nisl. Nunc rhoncus dui vel sem. Sed sagittis. Nam congue, risus semper porta volutpat, quam pede lobortis ligula, sit amet eleifend pede libero quis orci. Nullam molestie nibh in lectus. Pellentesque at nulla. Suspendisse potenti. Cras in purus eu magna vulputate luctus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus vestibulum sagittis sapien. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Etiam vel augue.",
"status":"active",
"quantity":58,
"categories":[
{
"name":"Electronics"
},
{
"name":"Sport"
}
],
"tags":[
"sweater"
]
}
Result: Creates a Document with id 1002
of the type product
inside the index ecommerce
with the values specified for each field.
#Mapping
The mapping contains Fields
(title, category) and each field has a Data Type
(string, long, double), also mapping contains meta fields
(_id, _type, _uid, _index)
When no mapping is defined, ES creates the mapping type and fields automatically whether we add indexes and/or documents without mapping.
Mapping can be defined when creating and index but also with existing ones by issuing a PUT
request.
ES may not guess correctly the mapping for your data, BE AWARE !
If there's an already type and field mapping for an index we CANNOT update the its mapping instead we should set a new index and reindex our data into that new index.
-
Core data types:
-
String:
- Full Text string: + Analyzer converts the string into a list of individuals before indexing. + Allows search for individual words. - Are not used for sorting and rarely used for aggregations.
'Los productos son pam, pem y pim'
- Keywords string: + Used for Filtering + Used for sorting and aggregation - Not analyzed so the exact string is added to the index
('tag', 'OKEY', 'marcos.segovia@uvinum.com'...)
- Full Text string: + Analyzer converts the string into a list of individuals before indexing. + Allows search for individual words. - Are not used for sorting and rarely used for aggregations.
-
Numeric:
- long: signed 64-bit integer
- integer: signed 32-bit integer
- short: signed 16-bit integer
- byte: signed 8-bit integer
- double: double-precision 64-bit floating point
- float: single-precision 32-bit floating point
-
Date:
- string: formatted dates
(2016-01-01)
- long number: miliseconds
- integer: seconds
It will internally be a long number in miliseconds
- string: formatted dates
-
Boolean:
- False values: false, "false", "off", "no", "0", "", 0, 0.0
- True values: The rest but false.
-
Binary (binary value Base64) - Not searchable
-
-
Complex data types:
- Object: Documents may contain objets with inner objects. So ES will flat the index to return a list of key-value, so per example: JSON
{ "customer": { "id": 1, "age": 24, "address": { "city": "Barcelona", "Country": "Spain" } } }
Will result in
"customer.id":1, "customer.age":24, "customer.address.city":"Barcelona", "customer.address.country":"Spain"
- Array:
- Array of strings
- Array of integers
- Array of arrays
- Array of objects (flattens the hierarchy !):
Will result in[{"name":"Marcos", "age": 24}, {"name": "Xavi", "age": 25}]
{"name":["Marcos", "Xavi"], "age": ["25", 24]}
- Nested (Used to index objects and mantain the hierarchy)
-
Geo data types:
- Geo Point (4 types of format): latitude and longitude pairs
- Geo Shape: Array of arrays
-
Specialized data types:
- IPv4: Indexed as long values
- Completion: Used to auto-complete functionality.
- Token Couns
- Attachment (plugin required and installed in every Node): Content is stored as Base64 string and lets ES index common formats like PDF, XLS, PPT...
Documents have metadata fields associated. And they can be customized then definind its mapping.
-
Identity:
- _index: Allows matching by its indexes
- _type: Makes searching by type name faster
- _id: Document id, is not indexed
- _uid: Index as a combination of type#id
-
Document Source:
- _source: Source JSON, it is stored in case of fetched requests. (Can be disabled to save storage space)
- _size(plugin required): Size of the _source field in bytes
-
Indexing: _all: concatenates all the values of the fields in one string. Allows searching for values without knowing the field. _field_names: indexes the names of every field. Used by
exists
andmissing
queries. -
Routing:
- _parent: Used to establish parent-child relationship between documents.
- _routing: Route a Document to a particular shard. Used to define which shared stores which documents.
-
Other:
- _meta: Customer meta type. Application use.
#Searching
##Relevancy & Scoring
A score is calculated for each document that matches the requested query
. The higher the more relevant
Queries may apply to 2 different contexts:
- Query Context: Match regard to the score achieved
- Filter Context: Match if satisfy the requirements of filtering
##Ways of Searching
###Query String
A query is defined in the same URI of the Header. All fields are search by default. Simple queries
Request:GET http://127.0.0.1:9200/ecommerce/product/_search?q?=jellytime
###Query DSL
A query is defined in the body as JSON. Common way to make advanced queries.
Request:GET http://127.0.0.1:9200/ecommerce/product/_search
Body:
{
"query": {
"match": {
"name": "jellytime",
}
}
}
##Types of Queries
###Leaf Queries Look for particular values in particular fields. Simple unit of query.
###Compound Queries Wrap leaft queries or other Compound queries. Useful for combining multiple queries.
###Full text Queries Running full text queries in full text fields. So for example search a body of an email into the Index o Emails. We can apply an specific analyzer to execute different behaviour depending on our needs.
###Term level Queries
Look for exact matching of values. Usually in Years, numbers, etc.
A document that has a field with pasta!
would not be findable with a term level query for the field name pasta
. On the other hand submiting a full text query with the default analyzer will result in finding the document.
###Joining Queries ES provides 2 forms of join designed for horizontal scalability
- Nested query: To query each specific object inside Nested fields
- has_child/has_parent query: To query documents that has a relation of hineritance within a child or parent, so we can apply clauses to the parent or child.
###Geo Queries
##Queries Context
There are two queries context
###Query Context
Queries in query context affect the relevance score of Documents, depending on how well they match.
###Filter Context
Queries in filter context do not affect relevance score. They are used to exclude Documents from the results if they do not satisfy the requirements.
##Aggregations
Aggregations are way of grouping and extracting statistics from your data.
Similar in Mysql to
GROUP BY
ES provides a feature that allows you to execute searches and return them as normal but also return aggregates result at the same time. This aggregate results are separated from the searches. So we can get different groups of data in a single request.
There are a few types of aggregations:
- Metric
- Bucket
- Pipeline
Metric works on values extracted from the agreggates documents. Most of them outputs a single numeric value metric. Or could may use them to return multi-numeric value.