marcossegovia/Elastic_search_course.md Secret

## Elastic_search_course.md

      
    Raw
  

              Elastic_search_course.md
            
          
    #ElasticSearch
#Features

ES based on Apache Lucene
Written in Java
Focus on scalability and distributed from the ground up
Designed to take, analyze and search data from any source
JSON-Schema

Communication is done through HTTP REST API

curl -X <REST Verb> <Node>:<Port>/<Index>/<Type>/<ID>
curl -X GET http://uvinum.marcos.vm:9200/person/employee/123

#Terminology

RealTime Engine: One change to an index it is propagated through its entire cluster.
For large clusters we are talking about a delay of 1 second.
Cluster: Collection of Nodes. So generally a cluster consists of one or more Nodes. Depending on the scale. A Cluster provides indexing and searching capabilities across all nodes. By default named "elasticsearch"
Nodes: A single server, taking part of a cluster. Stores the searchable data. Stores all data if there is only one node, otherwise stores part of the data. By default joins Cluster named "elasticsearch"
Index: Collection of Documents(same of different Types). Indentified by a lowercased name. Used when indexing, searching, updating and deleting documents. Can have one or more Types defined. Can have as many Documents as you want.
Type: Class/Category of similar Documents. ex: product, account, user. Consist of a name and a Mapping (mapping do not need to be explicitly defined). Stored as metadata _type(when searching it implies filter by this field)
Mapping: Describes the fields of a document of a given Type: string, integer... Also includes HOW fields should be indexed and stored. If not mapping defined the Dynamic Mapping would defined the mapping based on the fields it finds.
Document: Basic unit of information that can be indexed. Key-Value fields. Can be a single User, Order, Product... JSON based.


Similarity with a Relational Database: Index -> Database, Type -> Table, Document -> Row


Shards: Pieces where Indexes can be divided into. Fully functional and independent index. Its number can be specified when creating an Index (default is 5). Allow to scale horizontally as when an Index contains lot of data and its Node does not have enough storage space, another Shard can be created into another different Node to provide the necessary space. Allows distribution and parallelization which means performance.
Replicas: A copy of a Shard (default is 1). A replica never resides on the same Node as its original Shard. Provides high availability when a Shard or Node fails and allows scaling search volume because search queries can be executed on all replicas in parallel.


So by Default for each Index would be 5 Shards and 1 Replica for each of its Shards (5 Replicas).

#Basic Requests
We should use an HTTP Client to perform the requests. (Ex: Postman)

Request: PUT "http://127.0.0.1:9200/ecommerce"
Result: Creates an index ecommerce

Request: PUT "http://127.0.0.1:9200/ecommerce/product/1001"
Body:
{
	"name": "JellyTime",
	"price": "100.00",
	"description": "Peanut Butter JellyTime !",
	"status": "active",
	"quantity": 1,
	"categories":[
		{"name": "Toys"}
	],
	"tags": ["jellytime", "toys", "yellow"]
}
Result: Creates a Document to the Index ecommerce of the type product with id 1001

We can replace the Document by changing the body provided and sending a PUT request with all the same fields.


Request: POST "http://127.0.0.1:9200/ecommerce/product/1001/_update"
Body:
{
	"doc":{
		"price": 50.00
	}
}
Result: It will update ONLY the specified fields of the current Document.

Request:DELETE "http://127.0.0.1:9200/ecommerce"
Result: Deletes the index ecommerce
Request:DELETE "http://127.0.0.1:9200/ecommerce/product/1001"
Result: Deletes the Document 1001 of the type product inside the index ecommerce

Request: GET "http://127.0.0.1:9200/_cat/indices?v"
Result: Show Indeces information in our Cluster

Data Bulking

By default if not specified, ES creates Documents with the data provided
Request: POST "htpp://127.0.0.1:9200/ecommerce/product/_bulk"
Body:
{"index":{"_id":"1"}}
{
   "name":"Stainless Steel Cleaner Vision",
   "price":"108.11",
   "description":"Nullam orci pede, venenatis non, sodales sed, tincidunt eu, felis. Fusce posuere felis sed lacus. Morbi sem mauris, laoreet ut, rhoncus aliquet, pulvinar sed, nisl. Nunc rhoncus dui vel sem. Sed sagittis. Nam congue, risus semper porta volutpat, quam pede lobortis ligula, sit amet eleifend pede libero quis orci. Nullam molestie nibh in lectus. Pellentesque at nulla. Suspendisse potenti. Cras in purus eu magna vulputate luctus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus vestibulum sagittis sapien. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Etiam vel augue.",
   "status":"active",
   "quantity":58,
   "categories":[
      {
         "name":"Electronics"
      },
      {
         "name":"Sport"
      }
   ],
   "tags":[
      "sweater"
   ]
}
Result: Creates a Document with id 1002 of the type product inside the index ecommerce with the values specified for each field.
#Mapping
The mapping contains Fields(title, category) and each field has a Data Type(string, long, double), also mapping contains meta fields(_id, _type, _uid, _index)
When no mapping is defined, ES creates the mapping type and fields automatically whether we add indexes and/or documents without mapping.
Mapping can be defined when creating and index but also with existing ones by issuing a PUT request.

ES may not guess correctly the mapping for your data, BE AWARE !

If there's an already type and field mapping for an index we CANNOT update the its mapping instead we should set a new index and reindex our data into that new index.
Data Types


Core data types:


String:

Full Text string: + Analyzer converts the string into a list of individuals before indexing. + Allows search for individual words. - Are not used for sorting and rarely used for aggregations. 'Los productos son pam, pem y pim'
Keywords string: + Used for Filtering + Used for sorting and aggregation - Not analyzed so the exact string is added to the index  ('tag', 'OKEY', 'marcos.segovia@uvinum.com'...)


Numeric:

long: signed 64-bit integer
integer: signed 32-bit integer
short: signed 16-bit integer
byte: signed 8-bit integer
double: double-precision 64-bit floating point
float: single-precision 32-bit floating point


Date:

string: formatted dates (2016-01-01)
long number: miliseconds
integer: seconds


It will internally be a long number in miliseconds


Boolean:

False values: false, "false", "off", "no", "0", "", 0, 0.0
True values: The rest but false.


Binary (binary value Base64) - Not searchable


Complex data types:

Object: Documents may contain objets with inner objects. So ES will flat the index to return a list of key-value, so per example:
JSON

{
    "customer": {
    	"id": 1,
    	"age": 24,
    	"address": {
    		"city": "Barcelona",
    		"Country": "Spain"
    	}
    }
}
Will result in
"customer.id":1,
"customer.age":24,
"customer.address.city":"Barcelona",
"customer.address.country":"Spain"


Array:

Array of strings
Array of integers
Array of arrays
Array of objects (flattens the hierarchy !):

[{"name":"Marcos", "age": 24}, {"name": "Xavi", "age": 25}]

Will result in
{"name":["Marcos", "Xavi"], "age": ["25", 24]}


Nested (Used to index objects and mantain the hierarchy)


Geo data types:

Geo Point (4 types of format): latitude and longitude pairs
Geo Shape: Array of arrays


Specialized data types:

IPv4: Indexed as long values
Completion: Used to auto-complete functionality.
Token Couns
Attachment (plugin required and installed in every Node): Content is stored as Base64 string and lets ES index common formats like PDF, XLS, PPT...


Meta Fields

Documents have metadata fields associated. And they can be customized then definind its mapping.


Identity:

_index: Allows matching by its indexes
_type: Makes searching by type name faster
_id: Document id, is not indexed
_uid: Index as a combination of type#id


Document Source:

_source: Source JSON, it is stored in case of fetched requests. (Can be disabled to save storage space)
_size(plugin required): Size of the _source field in bytes


Indexing:
_all: concatenates all the values of the fields in one string. Allows searching for values without knowing the field.
_field_names: indexes the names of every field. Used by exists and missing queries.


Routing:

_parent: Used to establish parent-child relationship between documents.
_routing: Route a Document to a particular shard. Used to define which shared stores which documents.


Other:

_meta: Customer meta type. Application use.


#Searching
##Relevancy & Scoring
A score is calculated for each document that matches the requested query. The higher the more relevant
Queries may apply to 2 different contexts:

Query Context: Match regard to the score achieved
Filter Context: Match if satisfy the requirements of filtering

##Ways of Searching
###Query String
A query is defined in the same URI of the Header. All fields are search by default. Simple queries
Request:GET http://127.0.0.1:9200/ecommerce/product/_search?q?=jellytime
###Query DSL
A query is defined in the body as JSON. Common way to make advanced queries.
Request:GET http://127.0.0.1:9200/ecommerce/product/_search
Body:
{
  "query": {
  	"match": {
  		"name": "jellytime",
  	}
  }
}
##Types of Queries
###Leaf Queries
Look for particular values in particular fields. Simple unit of query.
###Compound Queries
Wrap leaft queries or other Compound queries. Useful for combining multiple queries.
###Full text Queries
Running full text queries in full text fields. So for example search a body of an email into the Index o Emails. We can apply an specific analyzer to execute different behaviour depending on our needs.
###Term level Queries
Look for exact matching of values. Usually in Years, numbers, etc.
A document that has a field with pasta! would not be findable with a term level query for the field name pasta. On the other hand submiting a full text query with the default analyzer will result in finding the document.
###Joining Queries
ES provides 2 forms of join designed for horizontal scalability

Nested query: To query each specific object inside Nested fields
has_child/has_parent query: To query documents that has a relation of hineritance within a child or parent, so we can apply clauses to the parent or child.

###Geo Queries
##Queries Context
There are two queries context
###Query Context
Queries in query context affect the relevance score of Documents, depending on how well they match.
###Filter Context
Queries in filter context do not affect relevance score. They are used to exclude Documents from the results if they do not satisfy the requirements.
##Aggregations
Aggregations are way of grouping and extracting statistics from your data.

Similar in Mysql to GROUP BY

ES provides a feature that allows you to execute searches and return them as normal but also return aggregates result at the same time. This aggregate results are separated from the searches. So we can get different groups of data in a single request.
There are a few types of aggregations:

Metric
Bucket
Pipeline

Metric works on values extracted from the agreggates documents. Most of them outputs a single numeric value metric. Or could may use them to return multi-numeric value.