iamitshri/elastic-search-research.md

## elastic-search-research.md

      
    Raw
  

              elastic-search-research.md
            
          
    Elastic search  use case analysis


Popularaly known as ELK stack ( Elastic search, Logstash, Kibana)


Who uses ELK?

To be Added


Cloud Providers (Managed Infrastruture for Search implementations)

Azure:https://azuremarketplace.microsoft.com/en-us/marketplace/apps/elastic.elasticsearch?tab=Reviews
Google Cloud: https://console.cloud.google.com/marketplace/details/click-to-deploy-images/elasticsearch
AWS ES service: https://aws.amazon.com/elasticsearch-service/
Elastic Search cloud: https://www.elastic.co/cloud
Amazon Cloud search ( It uses Apache Solr internally)


Other options

Elastic Search Docker Containers: https://hub.docker.com/_/elasticsearch
Apache Solr


Is it open source?

yes ( ELK is open source)
X-pack is paid feature


PROS &CONS

Elastic search

PROS:


Spring framework's official support

https://github.com/spring-projects/spring-data-elasticsearch


Faster Time to market

Fuzzy search
Feature rich search support
Auto complete
Aggregation
Sorting, paging, selective field retrieval
Regex based search
Ability to tweak scoring and ranking algorithms


Fully managed Cloud implementations are available
Community support ( Q&A on stackoverflow, blogs, documentation)
Plenty of ways of learn this skill

pluralsight, lynda, udemy, youtube


CONS:


New infrastructure cost

It could take some time to tune cluster as per our search needs.


There is some learning curve

Staff training
ES specific Json based query language.


We still have to write Elasticsearch-specific code for:

Indexing data
Background job that upserts documents in the index, due to user activity


In house development of database based search features:

PROS:

When people leave, finding java, sql skillset in the market is easier than elastic search ( Just a guess)
Existing infrastructure is enough

CONS:

We will have code features that ES provides out of box:

Auto completion  paging, sorting, regex support, aggregation etc


Change existing logic each time we have to support new requirements.
Search could get slow.

We will have to tune our code to make sure we don’t breach SLA or user experience


Development & maintenance of the growing search related codebase


Steps in making progress towards using elastic search tool


Think through a search use case
Indexing: Getting data into the ES

Each item in index is document. so decide the shape of json that represents a document
Do the appropriate mapping to solr data types
Create indexing job and incremental indexing job that updates/deletes/adds new entries
Mapping: Deciding shape of json and data type of fields to the ES data types


EMA: Evaluation management platform, that enables evaluators to grade submissions easily.
Following items are searchable in EMA

Assessment, task properties
EMA User properties such as Roles, Permissions, Assigned Tasks
Submission Search
Evaluation Searchs


Getting Setup Locally

Install elastic search 6.2.2
Install Kibana 6.2.2
As an alternative you can get a docker image that packs everything you need

https://github.com/elastic/elasticsearch-docker
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html


Top features in Elastic search
Aggregation
Auto completion
Full text search
Paging, Sorting
Scoring & Ranking search result


Infrastructural action items:

Run it locally

Understand shard, index, cluster management and challenges with it
Understand resource requirements

Memory
Compute
Disk space


Cost comparison
Predict or Project cost of using as ES data size increases.
Backup policy
Time to reindex everything
How to run it without downtime. ( Availability)
running managed service in aws vs  running elastic service in ec2 instances


Document Management: Indexing & updating the documents:

The only dev task we need to do is to create:
- A job that builds the index.
- A job that scans the database tables to selectively insert, update or delete documents in the index.


Options to continuously feed data to index as new data arrives.

A service that feeds data to index?
Create a scheduled service that checks a database to see if there are new entries or updated entries.


Customize the document structure as per use case
Logstash

Data ingestion tool provided by elastic search


Some practical examples for querying data:


Query Use cases

Nested query

find submission by evaluator name, employee id, full name


Search Term

Starting with  search term
Ending with  search term
Containing  search term


Find by id ( number)
Find by date range
Find by regex search pattern
Fuzzy Searches
Aggregation
Auto completion


Boolean Query

field1 and field2
Field1 OR field2


Paging related

Sort asc desc
Size of result
Specific page in the result set
Get only certain columns in the result set


Date filtering


Time is saved in UTC
2018-10-15 12:59:23  is saved as  1539633563000
considering Mountain time zone 12:59:23 becomes 19:59:23

Search in Date range 
GET submission/_search
{
    "query": {
        "range" : {
            "dateCreated" : {
                "gte": "10/15/2018 19:59:23",
                "lte": "10/15/2018 19:59:24",
                "format": "MM/dd/yyyy HH:mm:ss"
            }
        }
    }
}


Get all the documents  
GET /submission/_search?pretty
{
  "query": {
    "match_all": {}
  }
}


Get the mapping 
GET /submission/submission/_mapping
GET /submission/


Delete the index 
DELETE /submission

curl -XDELETE "http://localhost:9200/submission"


Create the index 
curl -XPUT "http://localhost:9200/submission/" -H 'Content-Type: application/json' -d'
{}'

PUT /submission/
{}


Nested Query Example 
curl -XGET "http://localhost:9200/submission/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "_source": false,
  "query": {
    "nested": {
      "path": "evaluations",
      "inner_hits": {},
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "evaluations.evaluationId": "56818"
              }
            },
            {
              "term": {
                "evaluations.evaluator.firstName": {
                  "value": "Jennifer"
                }
              }
            }
          ]
        }
      }
    }
  }
}'


Example: Combining Nested Query with other query 
curl -XGET "http://localhost:9200/submission/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    
    "bool": {
      "must": [
        {
          "nested": {
      "path": "evaluations",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "evaluations.evaluationId": "56818"
              }
            },
            {
              "term": {
                "evaluations.evaluator.firstName": {
                  "value": "Jennifer"
                }
              }
            }
          ]
        }
      }
    }
        },
       {
         "range" : {
            "dateCreated" : {
                "gte": "10/15/2018 19:59:22",
                "lte": "10/15/2018 19:59:24",
                "format": "MM/dd/yyyy HH:mm:ss"
            }
        }
       } 
      ]
    }
    
  }
}'


Aggregations 
curl -XGET "http://localhost:9200/submission/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "user_defined": {
      "nested": {
        "path": "evaluations"
      },
      "aggs": {
        "user_defined_string": {
          "stats": {
            "field": "evaluations.evaluationId"
          }
        }
      }
    }
  }
}'


# of submissions by taskId  grouped by submission-status  
GET /submission/_search
{
  "size": 0,
  "query": {
    "match": {
      "taskId": "107"
    }
  },
  "aggs": {
    "number_of_submission": {
      "terms": {
        "field": "status"
      }
    }
  }
}


# of submissions  by evaluator Id grouped by submission-status 
GET /submission/_search
{
  "size": 0,
  "query": {
    "match": {
      "evaluations.evaluatorId": "E00104876"
    }
  },
  "aggs": {
    "number_of_submission": {
      "terms": {
        "field": "status"
      }
    }
  }
}


# of submissions  by student and task Id grouped by submission-status 
GET submission/_search?pretty
{
  "size": 0, 
   "query": {
      "bool": {
        "must": [
          {
            "match": {
              "studentId": "000067181"
            }
            
          },{
            "match": {
              "taskId": "183"
            }
          }
        ]
      }
   },
   "aggs": {
     "submission_status": {
       "terms": {
         "field": "status",
         "size": 10
       }
     }
   }
}


Specifying fields in the result, size of result and the offset 
GET submission/_search?pretty
{
  "_source": [
    "evaluations.evaluationId",
    "evaluations.evaluator.employeeId",
    "evaluations.status"
  ],
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "studentId": "000971426"
          }
        },
        {
          "match": {
            "taskId": "263"
          }
        }
      ]
    }
  }
}


EMA existing search features support

Find submissions by Student Id
Find submissions by Submission Id
Find submissions Evaluator first name
Find submissions Evaluator last Name
Find submissions Submission Status

Find submissions submitted :

	- < 24hrs 1 day ago
	- <72hrs  3 days ago 
	- <7days  7 days ago 
	- <30days  30 days 


GET submission/_search?pretty
{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "studentId": "000971426"
          }
        },
        {
          "match": {
            "submissionId": "4818"
          }
        },
        {
          "match": {
            "evaluations.evaluator.lastName": "Widick"
          }
        },
        {
          "match": {
            "evaluations.evaluator.firstName": "Lariann"
          }
        },
        {
          "match": {
            "taskId": "263"
          }
        },
        {
          "match": {
            "status": "64"
          }
        },
        {
          "range": {
           
            "dateUpdated": {
              "gte": "17/09/2018 05:54:37",
              "lte": "17/09/2018 05:54:37",
              "format": "dd/MM/yyyy HH:mm:ss"
            }
          }
        }
      ]
    }
  }
}


Evaluator workload: # of subs handled per task

# of submissions by task and evaluator

# of submissions by task

# of submissions by task and evaluator and student

submissions per evaluator per task during date range

evaluations per task break out by evaluation status

stats:average time spent per submission per task

stats: time spent by evaluator per submission per task

most evaluations

most submissions