Skip to content

Instantly share code, notes, and snippets.

@rupeshtiwari
Last active June 13, 2024 02:06
Show Gist options
  • Save rupeshtiwari/1e3c32a683258fbebdbbe430fba097fd to your computer and use it in GitHub Desktop.
Save rupeshtiwari/1e3c32a683258fbebdbbe430fba097fd to your computer and use it in GitHub Desktop.
OpenSearch Basics

What is OpenSearch and Amazon OpenSearch Service

OpenSearch is a distributed, community-driven, Apache 2.0-licensed, 100% open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. OpenSearch provides a highly scalable system for providing fast access and response to large volumes of data with an integrated visualization tool, OpenSearch Dashboards, that makes it easy for users to explore their data. OpenSearch is powered by the Apache Lucene search library, and it supports a number of search and analytics capabilities such as k-nearest neighbors (KNN) search, SQL, Anomaly Detection, Machine Learning Commons, Trace Analytics, full-text search, and more.

Amazon OpenSearch Service is an AWS-managed service that lets you run and scale OpenSearch clusters without having to worry about managing, monitoring, and maintaining your infrastructure, or having to build in-depth expertise in operating OpenSearch clusters.

Elastic Search is distributed database that runs in multiple servers (nodes). Database can be horizontally scalable. Document is stored in a JSON format. Elasticsearch supports many data types like text, number, Geo-spatial, IP addresses etc Elasticsearch stores data in a data structure called inverted index, where data is literally stored as searches. This makes querying very fast even if vast amounts of data storage.

image

image

We have 3 documents. ES will tokenize these docs and find all unique words like "be", "left" etc. Elastic Search is storing "be" is present in document id "1" and occurrs "2" times and positioned at "2" and "6" position.

image

Nodes, Indexes and Shards

  • Node means a computer (server) running Elasticsearch
  • An Index is a logical group of one or more physical shards. Each shard is a Lucene index (a self-contained index)

image

  • There are two types of shards: primary and replicas. Replica shards are for redundancy and serving data queries.
  • The shards, data and queries are distributed among nodes to facilitate availability and scalability in a multi-node (multiple servers) cluster. The shards and data are automatically re-balanced when a node is added or removed

image

Index Template

  • Index template is the settings applied to the index while creation. It is like a blueprint for creating an index
  • Index template contains settings like number of shards and replicas, data mapping, priority etc.
  • Index data mapping of an index template defines the schema of documents stored in the index. Index data mapping can be set to dynamic, so that the schema will be derived while the data is being ingested. This is also called Schema on Write. If the index data mapping is set to strict, just like an RDBMS, the index will reject the incoming documents not complying to the index data mapping properties. The following is an example console command to create an Index template
PUT _index_template/template_1
{
  "index_patterns": ["te*", "bar*"],
  "template": {
    "settings": {
      "number_of_shards": 1
    },
    "mappings": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "host_name": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z yyyy"
        }
      }
    },
    "aliases": {
      "mydata": { }
    }
  },
  "priority": 500,
  "composed_of": ["component_template1", "runtime_component_template"], 
  "version": 3,
  "_meta": {
    "description": "my custom"
  }
}

Index alias in Elasticsearch

image

  • An index alias is a group of indices. Documents can be inserted into an index group using alias. Only the index marked as write index can accept documents for insertion
  • An alias can be specified to include all the indices following an index pattern (like mylogs-*). The following command creates an alias named “logs” that groups all indices starting with “logs-”
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "logs-*",
        "alias": "logs"
      }
    }
  ]
}
  • Using index alias with index lifecycle management, data of an index can be automated to roll over into new index based on a threshold age or size, so that the data of an index can be split into multiple indices for efficiency and tiered storage. Also splitting data into multiple indices also can utilize multi node cluster resources for parallel data queries

Data streams in Elasticsearch

  • Data stream is an abstraction on top of index designed for append only time-series documents. The clients interact with data stream for updating documents. The data stream stores data in backing indexes (also called hidden indices).
  • New index will be created as per the configured index lifecycle policy thresholds (like threshold age, threshold size etc.). Data can be queried from all indices but can be written only to the latest index.

image

Reference:

https://www.youtube.com/watch?v=LqXj1oC1FH0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment