Skip to content

Instantly share code, notes, and snippets.

@iamitshri
Created February 1, 2019 18:34
Show Gist options
  • Save iamitshri/32336900d84704f112856d1f12cc5ce1 to your computer and use it in GitHub Desktop.
Save iamitshri/32336900d84704f112856d1f12cc5ce1 to your computer and use it in GitHub Desktop.
elastic search research

Elastic search use case analysis


PROS &CONS

Elastic search
PROS:
  • Spring framework's official support
  • Faster Time to market
    • Fuzzy search
    • Feature rich search support
    • Auto complete
    • Aggregation
    • Sorting, paging, selective field retrieval
    • Regex based search
    • Ability to tweak scoring and ranking algorithms
  • Fully managed Cloud implementations are available
  • Community support ( Q&A on stackoverflow, blogs, documentation)
  • Plenty of ways of learn this skill
    • pluralsight, lynda, udemy, youtube
CONS:
  • New infrastructure cost
    • It could take some time to tune cluster as per our search needs.
  • There is some learning curve
    • Staff training
    • ES specific Json based query language.
  • We still have to write Elasticsearch-specific code for:
    • Indexing data
    • Background job that upserts documents in the index, due to user activity

In house development of database based search features:

PROS:

  • When people leave, finding java, sql skillset in the market is easier than elastic search ( Just a guess)
  • Existing infrastructure is enough

CONS:

  • We will have code features that ES provides out of box:
    • Auto completion paging, sorting, regex support, aggregation etc
  • Change existing logic each time we have to support new requirements.
  • Search could get slow.
    • We will have to tune our code to make sure we don’t breach SLA or user experience
  • Development & maintenance of the growing search related codebase

Steps in making progress towards using elastic search tool

  • Think through a search use case
  • Indexing: Getting data into the ES
    • Each item in index is document. so decide the shape of json that represents a document
    • Do the appropriate mapping to solr data types
    • Create indexing job and incremental indexing job that updates/deletes/adds new entries
    • Mapping: Deciding shape of json and data type of fields to the ES data types

EMA: Evaluation management platform, that enables evaluators to grade submissions easily. Following items are searchable in EMA

  • Assessment, task properties
  • EMA User properties such as Roles, Permissions, Assigned Tasks
  • Submission Search
  • Evaluation Searchs

Getting Setup Locally


  • Top features in Elastic search Aggregation Auto completion Full text search Paging, Sorting Scoring & Ranking search result

Infrastructural action items:

  • Run it locally
    • Understand shard, index, cluster management and challenges with it
    • Understand resource requirements
      • Memory
      • Compute
      • Disk space
  • Cost comparison
  • Predict or Project cost of using as ES data size increases.
  • Backup policy
  • Time to reindex everything
  • How to run it without downtime. ( Availability)
  • running managed service in aws vs running elastic service in ec2 instances

Document Management: Indexing & updating the documents:

The only dev task we need to do is to create:
- A job that builds the index.
- A job that scans the database tables to selectively insert, update or delete documents in the index.
  • Options to continuously feed data to index as new data arrives.
    • A service that feeds data to index?
    • Create a scheduled service that checks a database to see if there are new entries or updated entries.
  • Customize the document structure as per use case
  • Logstash
    • Data ingestion tool provided by elastic search

Some practical examples for querying data:

  • Query Use cases

    • Nested query
      • find submission by evaluator name, employee id, full name
    • Search Term
      • Starting with search term
      • Ending with search term
      • Containing search term
    • Find by id ( number)
    • Find by date range
    • Find by regex search pattern
    • Fuzzy Searches
    • Aggregation
    • Auto completion
  • Boolean Query

    • field1 and field2
    • Field1 OR field2
  • Paging related

    • Sort asc desc
    • Size of result
    • Specific page in the result set
    • Get only certain columns in the result set

Date filtering

Time is saved in UTC 2018-10-15 12:59:23 is saved as 1539633563000 considering Mountain time zone 12:59:23 becomes 19:59:23

Search in Date range
GET submission/_search
{
    "query": {
        "range" : {
            "dateCreated" : {
                "gte": "10/15/2018 19:59:23",
                "lte": "10/15/2018 19:59:24",
                "format": "MM/dd/yyyy HH:mm:ss"
            }
        }
    }
}
Get all the documents
GET /submission/_search?pretty
{
  "query": {
    "match_all": {}
  }
}
Get the mapping
GET /submission/submission/_mapping
GET /submission/
Delete the index
DELETE /submission

curl -XDELETE "http://localhost:9200/submission"
Create the index
curl -XPUT "http://localhost:9200/submission/" -H 'Content-Type: application/json' -d'
{}'

PUT /submission/
{}

Nested Query Example
curl -XGET "http://localhost:9200/submission/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "_source": false,
  "query": {
    "nested": {
      "path": "evaluations",
      "inner_hits": {},
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "evaluations.evaluationId": "56818"
              }
            },
            {
              "term": {
                "evaluations.evaluator.firstName": {
                  "value": "Jennifer"
                }
              }
            }
          ]
        }
      }
    }
  }
}'
Example: Combining Nested Query with other query
curl -XGET "http://localhost:9200/submission/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    
    "bool": {
      "must": [
        {
          "nested": {
      "path": "evaluations",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "evaluations.evaluationId": "56818"
              }
            },
            {
              "term": {
                "evaluations.evaluator.firstName": {
                  "value": "Jennifer"
                }
              }
            }
          ]
        }
      }
    }
        },
       {
         "range" : {
            "dateCreated" : {
                "gte": "10/15/2018 19:59:22",
                "lte": "10/15/2018 19:59:24",
                "format": "MM/dd/yyyy HH:mm:ss"
            }
        }
       } 
      ]
    }
    
  }
}'
Aggregations
curl -XGET "http://localhost:9200/submission/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "user_defined": {
      "nested": {
        "path": "evaluations"
      },
      "aggs": {
        "user_defined_string": {
          "stats": {
            "field": "evaluations.evaluationId"
          }
        }
      }
    }
  }
}'
# of submissions by taskId grouped by submission-status
GET /submission/_search
{
  "size": 0,
  "query": {
    "match": {
      "taskId": "107"
    }
  },
  "aggs": {
    "number_of_submission": {
      "terms": {
        "field": "status"
      }
    }
  }
}
# of submissions by evaluator Id grouped by submission-status
GET /submission/_search
{
  "size": 0,
  "query": {
    "match": {
      "evaluations.evaluatorId": "E00104876"
    }
  },
  "aggs": {
    "number_of_submission": {
      "terms": {
        "field": "status"
      }
    }
  }
}
# of submissions by student and task Id grouped by submission-status
GET submission/_search?pretty
{
  "size": 0, 
   "query": {
      "bool": {
        "must": [
          {
            "match": {
              "studentId": "000067181"
            }
            
          },{
            "match": {
              "taskId": "183"
            }
          }
        ]
      }
   },
   "aggs": {
     "submission_status": {
       "terms": {
         "field": "status",
         "size": 10
       }
     }
   }
}
Specifying fields in the result, size of result and the offset
GET submission/_search?pretty
{
  "_source": [
    "evaluations.evaluationId",
    "evaluations.evaluator.employeeId",
    "evaluations.status"
  ],
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "studentId": "000971426"
          }
        },
        {
          "match": {
            "taskId": "263"
          }
        }
      ]
    }
  }
}
EMA existing search features support
  • Find submissions by Student Id
  • Find submissions by Submission Id
  • Find submissions Evaluator first name
  • Find submissions Evaluator last Name
  • Find submissions Submission Status

Find submissions submitted :

	- < 24hrs 1 day ago
	- <72hrs  3 days ago 
	- <7days  7 days ago 
	- <30days  30 days 
GET submission/_search?pretty
{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "studentId": "000971426"
          }
        },
        {
          "match": {
            "submissionId": "4818"
          }
        },
        {
          "match": {
            "evaluations.evaluator.lastName": "Widick"
          }
        },
        {
          "match": {
            "evaluations.evaluator.firstName": "Lariann"
          }
        },
        {
          "match": {
            "taskId": "263"
          }
        },
        {
          "match": {
            "status": "64"
          }
        },
        {
          "range": {
           
            "dateUpdated": {
              "gte": "17/09/2018 05:54:37",
              "lte": "17/09/2018 05:54:37",
              "format": "dd/MM/yyyy HH:mm:ss"
            }
          }
        }
      ]
    }
  }
}

Evaluator workload: # of subs handled per task

# of submissions by task and evaluator

# of submissions by task

# of submissions by task and evaluator and student

submissions per evaluator per task during date range

evaluations per task break out by evaluation status

stats:average time spent per submission per task

stats: time spent by evaluator per submission per task

most evaluations

most submissions