Skip to content

Instantly share code, notes, and snippets.

@danielpsf
Last active June 10, 2019 19:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save danielpsf/e65141d020a168fa2f3c80f2d8445a4e to your computer and use it in GitHub Desktop.
Save danielpsf/e65141d020a168fa2f3c80f2d8445a4e to your computer and use it in GitHub Desktop.
ElasticSearch Definitive Guide's notes

Elastic Search’s definitive guide notes

Chapter 1. You know, for search

  • RESTful web service on top of Apache Lucene
  • Has many clients that either uses TrasportClientor HTTP Clients
    • TransportClientis scheduled to be removed on ElasticSearch 8.0
  • Has two kind of query mechanisms
    • Query string
    • Query DSL
  • Index could be interpreted as a SQL Database
  • To Index could be interpreted as the act of inserting data into an index

Example of commands to build an index, index data and then query it

HR has requested an employee directory for Megacorp that has to:

  • Enable data to contain multi value tags, numbers, and full text.
  • Retrieve the full details of any employee.
  • Allow structured search, such as finding employees over the age of 30.
  • Allow simple full-text search and more-complex phrase searches.
  • Return highlighted search snippets from the text in the matching documents.
  • Enable management to build analytic dashboards over the data.

Add an employee

PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

Fetch employee by id

GET /megacorp/employee/1

Fetch all employees

GET /megacorp/employee/_search

Fetch all employees filter by last name through query string

GET /megacorp/employee/_search?q=last_name:Smith

Fetch all employees filter by last name through query DSL

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "last_name": "Smith"
    }
  }
}

Fetch all employees filter by last name and above certain age

GET /megacorp/employee/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "last_name": "Smith"
          }
        }
      ],
      "filter": {
        "range": {
          "age": {
            "gte": 30
          }
        }
      }
    }
  }
}

Fetch all employees that either like rock or climbing or rock climbing

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "about": "rock climbing"
    }
  }
}

Fetch all employees that likes rock climbing

GET /megacorp/employee/_search
{
  "query": {
    "match_phrase": {
      "about": "rock climbing"
    }
  }
}

Fetch all employees that likes rock climbing plus the highlights of the findings

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

Fetch all interest of all employees

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests.keyword"
      }
    }
  }
}

Fetch average age per interest of all employees

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests.keyword"
      },
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

Chapter 2. Life Inside a Cluster

Scaling

Most databases (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, etc) benefits most of horizontal scaling and making vertical scaling could mean tweak the cliente application at least a little bit to make it work as expected on the oposite side, ElasticSearch is build from the ground up to be scalable and high available, which means your application doesn't need to handle any of the cumbersome tasks that normal databases requires to scale up/down (horizontally) or out/in (vertically).

Cluster

ElasticSearch node's can play several roles at once and a cluster means that there are/is node(s) under the same cluster.name property.

A master node is in charge of managing cluster-wide operations, such as creating or deliting an index, adding or removing a node from the cluster.

A master node is not in charge of document-level changes or search, which means that having one master node doesn't necessarily will cause a bottleneck.

Users can talk to any node in the cluster, including the master and every node knows where all the documents lives allowing them to forward the request directly to the nodes that hold the data. Whichever node picks up the request will handle the burden of gathering the response from node or nodes, holding the data and then returning the filnal response to the clients.

Health

Retrieving the cluster health is as easy as querying data.

GET /_cluster/health

Among the result below the most interesting is status

{
  "cluster_name": "elasticsearch",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 8,
  "active_shards": 8,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 5,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 61.53846153846154
}
Statuses
Status Description
green All primary and replica shards are active
yellow All primary shards are active, but not all replica shards are active
red Not all primary shards are active

Add an Index

An index is nothing more than a logical namespace that points to one or more physical shards.

A shards is a low-level worker unit that holds just a slice of all the data in the index. It also containsan single instance of Apache Lucene, meaning it is a complete search engine in its own.

The number of shards in an index is fixed at the time that an index is created and can be changed at any time.

To create an index without allowing ElastiSearch to assume the default configuration (five primary shards) you can use the command below

PUT /blogs
{
   "settings" : {
      "number_of_shards" : 3,
      "number_of_replicas" : 1
   }
}

To start a second node you need to edit {elasticsearch_home}/conf/elasticsearch.yml adding the property node.max_local_storage_nodes: INT_GREATER_THAN_1, as recomended in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment