Skip to content

Instantly share code, notes, and snippets.

@marcossegovia
Last active November 10, 2016 12:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save marcossegovia/c4f9585d0450791470485c68514acc05 to your computer and use it in GitHub Desktop.
Save marcossegovia/c4f9585d0450791470485c68514acc05 to your computer and use it in GitHub Desktop.

#ElasticSearch

#Features

  • ES based on Apache Lucene
  • Written in Java
  • Focus on scalability and distributed from the ground up
  • Designed to take, analyze and search data from any source
  • JSON-Schema

Communication is done through HTTP REST API

#Terminology

  • RealTime Engine: One change to an index it is propagated through its entire cluster. For large clusters we are talking about a delay of 1 second.
  • Cluster: Collection of Nodes. So generally a cluster consists of one or more Nodes. Depending on the scale. A Cluster provides indexing and searching capabilities across all nodes. By default named "elasticsearch"
  • Nodes: A single server, taking part of a cluster. Stores the searchable data. Stores all data if there is only one node, otherwise stores part of the data. By default joins Cluster named "elasticsearch"
  • Index: Collection of Documents(same of different Types). Indentified by a lowercased name. Used when indexing, searching, updating and deleting documents. Can have one or more Types defined. Can have as many Documents as you want.
  • Type: Class/Category of similar Documents. ex: product, account, user. Consist of a name and a Mapping (mapping do not need to be explicitly defined). Stored as metadata _type(when searching it implies filter by this field)
  • Mapping: Describes the fields of a document of a given Type: string, integer... Also includes HOW fields should be indexed and stored. If not mapping defined the Dynamic Mapping would defined the mapping based on the fields it finds.
  • Document: Basic unit of information that can be indexed. Key-Value fields. Can be a single User, Order, Product... JSON based.

Similarity with a Relational Database: Index -> Database, Type -> Table, Document -> Row

  • Shards: Pieces where Indexes can be divided into. Fully functional and independent index. Its number can be specified when creating an Index (default is 5). Allow to scale horizontally as when an Index contains lot of data and its Node does not have enough storage space, another Shard can be created into another different Node to provide the necessary space. Allows distribution and parallelization which means performance.
  • Replicas: A copy of a Shard (default is 1). A replica never resides on the same Node as its original Shard. Provides high availability when a Shard or Node fails and allows scaling search volume because search queries can be executed on all replicas in parallel.

So by Default for each Index would be 5 Shards and 1 Replica for each of its Shards (5 Replicas).

#Basic Requests

We should use an HTTP Client to perform the requests. (Ex: Postman)


Request: PUT "http://127.0.0.1:9200/ecommerce"

Result: Creates an index ecommerce


Request: PUT "http://127.0.0.1:9200/ecommerce/product/1001"

Body:

{
	"name": "JellyTime",
	"price": "100.00",
	"description": "Peanut Butter JellyTime !",
	"status": "active",
	"quantity": 1,
	"categories":[
		{"name": "Toys"}
	],
	"tags": ["jellytime", "toys", "yellow"]
}

Result: Creates a Document to the Index ecommerce of the type product with id 1001

We can replace the Document by changing the body provided and sending a PUT request with all the same fields.


Request: POST "http://127.0.0.1:9200/ecommerce/product/1001/_update"

Body:

{
	"doc":{
		"price": 50.00
	}
}

Result: It will update ONLY the specified fields of the current Document.


Request:DELETE "http://127.0.0.1:9200/ecommerce"

Result: Deletes the index ecommerce

Request:DELETE "http://127.0.0.1:9200/ecommerce/product/1001"

Result: Deletes the Document 1001 of the type product inside the index ecommerce


Request: GET "http://127.0.0.1:9200/_cat/indices?v"

Result: Show Indeces information in our Cluster


Data Bulking

By default if not specified, ES creates Documents with the data provided

Request: POST "htpp://127.0.0.1:9200/ecommerce/product/_bulk"

Body:

{"index":{"_id":"1"}}
{
   "name":"Stainless Steel Cleaner Vision",
   "price":"108.11",
   "description":"Nullam orci pede, venenatis non, sodales sed, tincidunt eu, felis. Fusce posuere felis sed lacus. Morbi sem mauris, laoreet ut, rhoncus aliquet, pulvinar sed, nisl. Nunc rhoncus dui vel sem. Sed sagittis. Nam congue, risus semper porta volutpat, quam pede lobortis ligula, sit amet eleifend pede libero quis orci. Nullam molestie nibh in lectus. Pellentesque at nulla. Suspendisse potenti. Cras in purus eu magna vulputate luctus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus vestibulum sagittis sapien. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Etiam vel augue.",
   "status":"active",
   "quantity":58,
   "categories":[
      {
         "name":"Electronics"
      },
      {
         "name":"Sport"
      }
   ],
   "tags":[
      "sweater"
   ]
}

Result: Creates a Document with id 1002 of the type product inside the index ecommerce with the values specified for each field.

#Mapping

The mapping contains Fields(title, category) and each field has a Data Type(string, long, double), also mapping contains meta fields(_id, _type, _uid, _index)

When no mapping is defined, ES creates the mapping type and fields automatically whether we add indexes and/or documents without mapping.

Mapping can be defined when creating and index but also with existing ones by issuing a PUT request.

ES may not guess correctly the mapping for your data, BE AWARE !

If there's an already type and field mapping for an index we CANNOT update the its mapping instead we should set a new index and reindex our data into that new index.

Data Types

  • Core data types:

    • String:

      • Full Text string: + Analyzer converts the string into a list of individuals before indexing. + Allows search for individual words. - Are not used for sorting and rarely used for aggregations. 'Los productos son pam, pem y pim'
      • Keywords string: + Used for Filtering + Used for sorting and aggregation - Not analyzed so the exact string is added to the index ('tag', 'OKEY', 'marcos.segovia@uvinum.com'...)
    • Numeric:

      • long: signed 64-bit integer
      • integer: signed 32-bit integer
      • short: signed 16-bit integer
      • byte: signed 8-bit integer
      • double: double-precision 64-bit floating point
      • float: single-precision 32-bit floating point
    • Date:

      • string: formatted dates (2016-01-01)
      • long number: miliseconds
      • integer: seconds

      It will internally be a long number in miliseconds

    • Boolean:

      • False values: false, "false", "off", "no", "0", "", 0, 0.0
      • True values: The rest but false.
    • Binary (binary value Base64) - Not searchable

  • Complex data types:

    • Object: Documents may contain objets with inner objects. So ES will flat the index to return a list of key-value, so per example: JSON
    {
        "customer": {
        	"id": 1,
        	"age": 24,
        	"address": {
        		"city": "Barcelona",
        		"Country": "Spain"
        	}
        }
    }

    Will result in

    "customer.id":1,
    "customer.age":24,
    "customer.address.city":"Barcelona",
    "customer.address.country":"Spain"
    
    • Array:
      • Array of strings
      • Array of integers
      • Array of arrays
      • Array of objects (flattens the hierarchy !):
      [{"name":"Marcos", "age": 24}, {"name": "Xavi", "age": 25}]
      
      Will result in
      {"name":["Marcos", "Xavi"], "age": ["25", 24]}
      
    • Nested (Used to index objects and mantain the hierarchy)
  • Geo data types:

    • Geo Point (4 types of format): latitude and longitude pairs
    • Geo Shape: Array of arrays
  • Specialized data types:

    • IPv4: Indexed as long values
    • Completion: Used to auto-complete functionality.
    • Token Couns
    • Attachment (plugin required and installed in every Node): Content is stored as Base64 string and lets ES index common formats like PDF, XLS, PPT...

Meta Fields

Documents have metadata fields associated. And they can be customized then definind its mapping.

  • Identity:

    • _index: Allows matching by its indexes
    • _type: Makes searching by type name faster
    • _id: Document id, is not indexed
    • _uid: Index as a combination of type#id
  • Document Source:

    • _source: Source JSON, it is stored in case of fetched requests. (Can be disabled to save storage space)
    • _size(plugin required): Size of the _source field in bytes
  • Indexing: _all: concatenates all the values of the fields in one string. Allows searching for values without knowing the field. _field_names: indexes the names of every field. Used by exists and missing queries.

  • Routing:

    • _parent: Used to establish parent-child relationship between documents.
    • _routing: Route a Document to a particular shard. Used to define which shared stores which documents.
  • Other:

    • _meta: Customer meta type. Application use.

#Searching

##Relevancy & Scoring

A score is calculated for each document that matches the requested query. The higher the more relevant

Queries may apply to 2 different contexts:

  • Query Context: Match regard to the score achieved
  • Filter Context: Match if satisfy the requirements of filtering

##Ways of Searching

###Query String

A query is defined in the same URI of the Header. All fields are search by default. Simple queries

Request:GET http://127.0.0.1:9200/ecommerce/product/_search?q?=jellytime

###Query DSL

A query is defined in the body as JSON. Common way to make advanced queries.

Request:GET http://127.0.0.1:9200/ecommerce/product/_search

Body:

{
  "query": {
  	"match": {
  		"name": "jellytime",
  	}
  }
}

##Types of Queries

###Leaf Queries Look for particular values in particular fields. Simple unit of query.

###Compound Queries Wrap leaft queries or other Compound queries. Useful for combining multiple queries.

###Full text Queries Running full text queries in full text fields. So for example search a body of an email into the Index o Emails. We can apply an specific analyzer to execute different behaviour depending on our needs.

###Term level Queries Look for exact matching of values. Usually in Years, numbers, etc. A document that has a field with pasta! would not be findable with a term level query for the field name pasta. On the other hand submiting a full text query with the default analyzer will result in finding the document.

###Joining Queries ES provides 2 forms of join designed for horizontal scalability

  • Nested query: To query each specific object inside Nested fields
  • has_child/has_parent query: To query documents that has a relation of hineritance within a child or parent, so we can apply clauses to the parent or child.

###Geo Queries

##Queries Context

There are two queries context

###Query Context

Queries in query context affect the relevance score of Documents, depending on how well they match.

###Filter Context

Queries in filter context do not affect relevance score. They are used to exclude Documents from the results if they do not satisfy the requirements.

##Aggregations

Aggregations are way of grouping and extracting statistics from your data.

Similar in Mysql to GROUP BY

ES provides a feature that allows you to execute searches and return them as normal but also return aggregates result at the same time. This aggregate results are separated from the searches. So we can get different groups of data in a single request.

There are a few types of aggregations:

  1. Metric
  2. Bucket
  3. Pipeline

Metric works on values extracted from the agreggates documents. Most of them outputs a single numeric value metric. Or could may use them to return multi-numeric value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment