Skip to content

Instantly share code, notes, and snippets.

@hemanth22
Forked from danieljameskay/Elastic.md
Created February 16, 2019 17:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hemanth22/aa86ceda01e2058496fda8fb3778f25e to your computer and use it in GitHub Desktop.
Save hemanth22/aa86ceda01e2058496fda8fb3778f25e to your computer and use it in GitHub Desktop.
Elastic

The Elastic Stack

Intro

Welcome to Friday afternoon L&D. This afternoon is going to be on the Elastic Stack. For those of you who dont know, the Elastic Stack is what was previously reffered to as ELK (ElasticSearch, Logstash and Kibana). It was changed because its more of a stack of applications now as opposed to 3 specific applications.

This session is going to be unstructed, there is no presentation, its going to be switching between notes, the browser, some code, config file and me rambling.

If you have questions PLEASE shout out when the question enters your brain, dont wait until the end/break in a sentence and there is no such thing as an irrelevant question...learning things like this can be thought of as a foundation of bricks, if there is a brick missing this could lead to misunderstandings futher down the line.

Second Intro

As mentioned Elastic is a stack of applications. It is made up of Kibana, ElasticSearch, Beats and Logstash.

Kibana allows us to vizualise our data, ElasticSeach allows us to store, search and analyse our data. Lastly, Beats and Logstash ingest the data from a source.

Data Pipeline tranquility :O)

https://www.elastic.co/products

We're going to be looking into these things from a high level, some topics will be then looked into with a bit more depth.

I've got three demos, one article and a ton of notes, graphs and other cool things to show you.

P.S. I'll try and keep the DYKWIM to a minimal.

ElasticSearch

My go to NoSQL database for streaming data pipelines and it has been for nearly a year now.

Elasticsearch is a distributed, JSON-based search and analytics engine designed for horizontal scalability, maximum reliability, and easy management.

It provides lightning fast search and discovery. Its scalable, you start small and increase in size when needed.

It facilitates aggregators, log analysis, geolocation data and machine learning.

The things we store in ES are refered to as Documents. Its a Document, a JSON Document, a payload of JSON.

Documents are stored in an Index. Example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it. A Document can be thought of as a Row, and the fields in that Document can be thought of as the columns.

ES is distributed by nature, it runs on multiple machines to make up a Cluster and each machine is called a Node. Every Node performs indexing operations in order to index all documents that have been added to ES. All nodes participate in search and analysis operations. Any search query that you run will be run on multiple nodes in parallel. Each Node has a unique ID and name.

A collection of Nodes is called a Cluster. Any index of documents that is created, the order in which the documents is searchable is stored within that cluster. A cluster has a unique name. The way you scale the number of nodes in a cluster is to have multiple nodes join the same cluster by specifying the name of the cluster. Nodes in the cluster find each other within the cluster automatically find each other by sending each other messages. The nodes have to be on the same network.

ES comes with a powerful RESTful API out-of-the-box which can be used for CRUD, monitoring and other operations.

Powerful query DSL which allows users too write complex queries

Its very easy to get up and running, can run from the command line, Docker, K8's!

Compatible with lots of the big data tools.

Schema free, the data doesn’t need a schema!

ES and Kibana are available as a service on AWS.

ES has a ton of use cases, on a shopping site it can be used to index your product catalog, inventory or provide auto-complete for users.

On a video hosting site it could be used to index the title of every video clip, the categories they fall under, tags that are associated with every video.

It can be used as an analysis engine, mine log data to extract insights.

A Price alerting platform, it can alert when a threshold is reached.

ES provides near real-time search. Very low latency between the time that a document is indexed until its available for searching. This time is around 1 second.

It supports data aggregation on huge datasets to find trends and patterns.

It comes with a Hadoop connector to provide seamless communication between Elasticsearch and Hadoop.

ElasticSearch APIs

ElasticSearch is made up of several API's which allows interaction with ElasticSearch over HTTP. The main ones are as follows:-

  • Document APIs - Uploading documents, deleting documents, etc
  • Search APIs - For performing searches against the documents stored in ElasticSearch
  • Indices APIs - The indices APIs are used to manage individual indices, index settings, aliases, mappings, and index templates.

Im going to go into a couple examples of the Indices API, Document and Search API.

Demo

Open Postman and demo the basic endpoints.

Indices APIs

One of the first things you need to do when you get ES up and running is create a home for your data, thats the Index. If you're running pipelines/applications that feed into ES there's a good chance that the Index will be created automatically for you.

If not, we can CURL.

curl -X PUT "localhost:9200/rabs"

The default number of shards created is 5 and replicas is 2. We can specify this when creating the Index.

curl -X PUT "localhost:9200/rabs" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3, 
            "number_of_replicas" : 2 
        }
    }
}`

We can create type mappings for our fields. Say we start taking the location of our customers, we could specify the field is a geolocation type when creating the index.

curl -X PUT "localhost:9200/rabs" -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "number_of_shards" : 4
    },
    "mappings" : {
        "_bet" : {
            "properties" : {
                "location" : { "type" : "geo_point" }
            }
        }
    }
}
'

Now's a good time to talk about Shards and Replicas.

Sharding and Replicas

When we create an Index we get an option to specify specifics about or Index.

"index" : {
  "number_of_shards" : 2,
  "number_of_replicas" : 1
}

Here we are specifying that our Index has 2 Shards and 1 Replica.

So what are Shards and Replicas?

Shards can be thought of as partitions, so if we specify 2 Shards the data will be split into 2 partitions.

The basic idea is that Elasticsearch will store each shard on a separate data node to increase the scalability.

During document search Elasticsearch will aggregate all documents from all available shards to consolidate the results so as to fulfil a user search request. It is totally transparent to the user.

So the concept is that an Index can be divided into multiple shards and each shard can be hosted on each data node. The placement of shards will be taken care of by Elasticsearch itself.

If we don't specify the number of shards in the index creation URL, Elasticsearch will create five shards per index by default.

If we create an Index with 1 Replica. It means Elasticsearch will create one copy (replica) of each shard and place each replica on separate data node other than the shard from which it is copied.

So once this happens were have a Primary Shard and a Replica Shard.

During a high volume of search activity, Elasticsearch can provide query results either from primary shards or from replica shards placed on different data nodes. This is how Elasticsearch increases the query throughput because each search query may go to different data nodes.

The Meetup Photo Stream

I've built a Kotlin application which streams data from the Meetup Photo Stream, performs a tiny bit of enrichement and dumps it in ElasticSearch so then we can view where the photos are been taken using the GeoLocation visualization.

First of all we create the Index.

PUT http://localhost:9200/meetup_event_photos

{
  "mappings": {
    "_photo": {
      "properties": {
        "photo_album.group.location": {
          "type": "geo_point"
        }
      }
    }
  }
}

Once we've done this the Index will be created in ES.

If we ran the application before creating the Index, the photo_album.group.location field would have been a String type which can't be used by the GeoLocation map.

Now the application can be started.

Points to get across here:

  • Basic Kibana Navigation
  • Creating an Index Pattern
  • The Discover page
  • Time Ranges
  • How to navigate the Discover Page
  • The Visualisation section

Document APIs

The Doucment APIs consist of the Single document APIs and the Mult-document APIs.

Single document API's

Index API The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts the JSON document into the "rab" index, under a type called rab with an id of 1236152731:

PUT rabs/_bet/1236152731  
{
  "user": "dash8789",
  "placement_timestamp": "2009-11-15T14:12:12",
  "number_of_Selections": 3
}

If that operation is a success we'll get another payload back.

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "rabs",
    "_type" : "_bet",
    "_id" : "1236152731",
    "_version" : 1,
    "_seq_no" : 0,
    "_primary_term" : 1,
    "result" : "created"
}

In some instances, we might not know the id of the Document, in that case we can substitue the PUT for a POST request.

PUT rabs/_bet/ 
{
  "user": "lukeynumber7",
  "placement_timestamp": "2019-01-01T14:12:12",
  "number_of_Selections": 4
}

If we do this we'll get the Id back in the response.

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "rabs",
    "_type" : "_bet",
    "_id" : "W0tpsmIBdwcYyG50zbta",
    "_version" : 1,
    "_seq_no" : 0,
    "_primary_term" : 1,
    "result": "created"
}

Get API The get API allows to get a typed JSON document from the index based on its id. The following example gets a JSON document from an index called rabs, under a type called _bet, with id valued W0tpsmIBdwcYyG50zbta:

GET rabs/_bet/W0tpsmIBdwcYyG50zbta

The response includes the document.

{
    "_index" : "rabs",
    "_type" : "_bet",
    "_id" : "W0tpsmIBdwcYyG50zbta",
    "_version" : 1,
    "_seq_no" : 10,
    "_primary_term" : 1,
    "found": true,
    "_source" : {
	  "user": "lukeynumber7",
	  "placement_timestamp": "2019-01-01T14:12:12",
	  "number_of_Selections": 4
    }
}

There are times when we wont require the _source and just need to make sure the Document exisits, we can add a flag on the request to achieve this.

GET rabs/_bet/W0tpsmIBdwcYyG50zbta?_source=false

We can specify which fields are returned and which fields are left out from the _source if the source is actually returned.

GET rabs/_bet/W0tpsmIBdwcYyG50zbta?_source_includes=*.user&_source_excludes=number_of_Selections

Other operations include delete, update, multi get, bulk operations.

Search API

The Search API allows you to execute a search query and get back search hits that match the query. The query can either be provided using a simple query string as a parameter, or using a request body.

Here is an example of a kibana_sample_data_flights document.

Here we are returning all the documents from the kibana_sample_data_flights index. The q=* parameter instructs ES to match all documents in the index.

GET /kibana_sample_data_flights/_search?q=*

Here we are sorting by OriginCityName in ascending order.

GET /kibana_sample_data_flights/_search?q=*&sort=OriginCityName:asc&pretty

The result of a Search is shown below.

{
  "took" : 24,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 13059,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "kibana_sample_data_flights",
        "_type" : "_doc",
        "_id" : "bS4HGGgBqdWvd8XgnyjR",
        "_score" : null,
        "_source" : {
          "FlightNum" : "6YG50L0",
          "DestCountry" : "JP",
          "OriginWeather" : "Cloudy",
          "OriginCityName" : "Abu Dhabi",
          "AvgTicketPrice" : 1056.2176972274638,

We can perform searches across multiple indices.

GET /deposits,withdrawls/_search?q=amount:1000

There are a ton of flags we can add to the URI search. A URI search is isnt the best way to search and is useful for a quick curl. Its advised when carrying out more advanced searches to use the Request Body search.

When using the Request Body Search we combine the Search API and Query DSL. The Query DSL is included in he body.

GET /rabs/_search
{
  "query": {
    "term": { "betId": "12312ABSVA" }
  }
}

Here is the reponse.

{
    "took": 1,
    "timed_out": false,
    "_shards":{
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits":{
        "total" : 1,
        "max_score": 1.3862944,
        "hits" : [
            {
                "_index" : "rabs",
                "_type" : "_bet",
                "_id" : "0",
                "_score": 1.3862944,
                "_source" : {
                  "betId: "12312ABSVA",
	          "user": "lukeynumber7",
	          "placement_timestamp": "2019-01-01T14:12:12",
	          "number_of_Selections": 4
                }
            }
        ]
    }
}

The Python Pipeline

The Python Pipeline simulates deposits happening every second with an amount between £1 and £1000. We can then do some cool things with time series.

{
  'user_currency_amount': 563, 
  'currency_code': 'GBP', 
  'payment_type': 'deposit', 'timestamp': datetime.datetime(2019, 1, 30, 9, 17, 57, 638066)}

Query DSL

Im going to talk about the Query DSL for a little bit. Elasticsearch provides a full Query DSL (Domain Specific Language) based on JSON to define queries.

The Query DSL consists of two clauses, leaf and compound query.

A leaf query uses a match, term or range which searches for a particular value in a given field.

A compound query combines leaf queries and other compound query clauses to execute a search.

The clauses behave differently depending on if they are used in a query context or a filter context.

A query clause used in a query context answers the question "How well does this document match this query clause?" Apart from deciding if the documents match a score is calcuated to indicate how well the documents matches relative to other documents.

A query clause in a filter context answers the question “Does this document match this query clause?” Its either yes or no, with no score. We would use a a filter context for filtering structure data e.g. Is the status of these bets PLACED?, "Does the deposit fall into the range of 500 and 1000 pound?

Match

Below is a match_all query that is used in the Request Body.

A match_all query is used when you want to search through all the documents in a given Index.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match_all": {}
  }
}

Focussing on the query part, the query tells us what the query definition is, inside…match_all is the query we want to run.

The match_all query is simply a search for all documents in the specified index.

We can pass in other values in the body such as size. Size is 10 by default.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match_all": {}
  },
  "size": 1
}

We can specify a from parameter which specifies which document index to start from. The size parameter is then used to return the number of documents starting at the from value. from is set to 0 by default.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match_all": {}
  },
	"from": 10
	"size": 10
}

By default all the fields of the documents found are returned. We can change that by specifying the fields we want returned in the _source field.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match_all": {}
  },
  "_source": ["OriginWeather", "DestWeather"]
}

So the above queries return all the documents in the specified index. This is because the match_all query is simply a search for all documents in the specified index, as mentioned earlier. Say we want to return a specific document.

We can use the match query to do this, think of it as a basic fielded search query. Here we are searching for records that have the value “sunny” for the “DestWeather” field.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match": { "DestWeather": "Sunny"}
  }
}
GET /kibana_sample_data_flights/_search
{
  "query": {
    "match": { "FlightId": "ABC123"}
  }
}

This example matches phrases. So it will return all records where the DestWeather is Sunny, Snowy, etc

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match_phrases": { "DestWeather": "S"}
  }
}

We can carry out a multi match that allows us to query multiple fields for the same value.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "multi_match": {
		"query": "Sunny",
		"fields": ["DestWeather", "OriginWeather"]
	  }
  }
}

So the records returned will be Sunny in either the Destination or Origin.

Term

If you want to find an exact term in a query, you use a term query. So here we are querying for an exact term, which is a Bet ID of 123ABC.

GET /rabs/_search
{
  "query": {
    "term": { "betId": "123ABC" }
  }
}

Interesting enough though, when searching for text, the match query will perform better as the term query is looking for an exact match. So here we are searching the Kibana Sample Data for a Flight ID of ABC123. This query will return Documents that have a Flight ID of ABC123 or abc123.

GET /kibana_sample_data_flights/_search
{
  "query": {
    "match": { "FlightId": "ABC123"}
  }
}

Range

We use a Range Query when we want to return Documents that fall into a particular range. So here we are searching for deposits that are greater than 500 pound but less than 1000.

Query:

GET /deposits/_search
{
  "query": {
    "range": {
      "amount": { "gte:" 500, "lt": 1000 } 
    }
  }
}

We have the following properties available:

gte ---> Greater than or equal to gt ---> Greater than lte ---> Less than or equal to lt ---> Less than

Logstash Pipeline

Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it and then sends it to your favourite stash.

Every single Logstash pipeline has 3 stages. These 3 stages are more or less the same stages of ETL.

The data comes into the pipeline, the data is processed and the data is sent out of the pipeline.

The data thats coming in can come from numerous different places including Kafka, Redis, MySQL, HTTP etc

The data is then processed, it can be validated, formatted, perform pattern matching.

The data is then sent on, this could be to ES, a log file, Kafka, HDFS etc.

I wrote an article which I'm going to briefly walk through and show you whats going on.

Kibana Sample Dashboard

Kibana comes with a 3 sample dashboards. I'm going to walk through the Web Logs one and show some of the cool charts we build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment