Skip to content

Instantly share code, notes, and snippets.

@ChenLiZhan
Last active February 2, 2017 05:41
Show Gist options
  • Save ChenLiZhan/4a8fdb0055a8eaf02b50b65e4cb7a023 to your computer and use it in GitHub Desktop.
Save ChenLiZhan/4a8fdb0055a8eaf02b50b65e4cb7a023 to your computer and use it in GitHub Desktop.
Note for Elasticsearch

名詞解釋

Relational DB -> Server -> Databases -> Schema -> Tables -> Rows -> Columns
Elasticsearch -> Node -> Indices -> Mapping -> Types  -> Documents -> Fields
  • 在 Elasticsearch 當中,每個儲存 Document 的動作我們稱之為 Indexing(索引)
  • Shard:通常叫做分片,這是 Elasticsearch 提供分散式搜尋的基礎,其含義是將一個完整的 Index 分成若干部分,儲存在相同或不同的 Node 上,這些組成 Index 的部分就叫做 Shard。
  • Replica:意思跟 Replication 差不多,就是 Shard 的備份,所以一個 Index 的 Shard 數量就等於 Shard × (1 + Replica)。
  • 映射(mapping)用於進行字段類型確認,將每個字段匹配為一種確定的數據類型(string, number, booleans, date等)。
  • 分析(analysis)用于進行全文文本(Full Text)的分词,以建立供搜索用的反向索引。

Elasticsearch 的 API 操作範例 (RESTful)

假設我們想要在名為 megacorp 的 Index 當中名為 employee 的 Type 下儲存一筆新的員工資料(Document)

PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

觀察所發送請求的 Path (/megacorp/employee/1) 可以整理出下面表格

Path Description
megacorp Index
employee Type
1 the id of the document

Elasticsearch 檢索範例

GET /megacorp/employee/_search

使用_search而不員工的 ID 來做檢索,會返回類似下面的資料格式,其中hits的部分包含了搜尋的結果(Elasticsearch 預設返回前十項 document)

{
   "took":      6,
   "timed_out": false,
   "_shards": { ... },
   "hits": {
      "total":      3,
      "max_score":  1,
      "hits": [
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "3",
            "_score":         1,
            "_source": {
               "first_name":  "Douglas",
               "last_name":   "Fir",
               "age":         35,
               "about":       "I like to build cabinets",
               "interests": [ "forestry" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "1",
            "_score":         1,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "2",
            "_score":         1,
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

假如我們想要透過一個 document 中的其中一個屬性來做檢索,我們可以發送類似下面的請求

GET /megacorp/employee/_search?q=last_name:Smith

我們依然使用 _search 關鍵字,並且傳送一個 URL Parameter q

使用 DSL 來搜尋

Elasticsearch 提供靈活的查詢語言(DSL),以提供我們建立更複雜的 Query

DSL (Domain Specific Language) 以 JSON 格式表現 我們可以以下列的 DSL 來取代剛剛的搜尋

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

接著試著透過 DSL 來定義更複雜的 Query

GET /megacorp/employee/_search
{
    "query" : {
        "filtered" : {
            "filter" : {
                "range" : {
                    "age" : { "gt" : 30 }
                }
            },
            "query" : {
                "match" : {
                    "last_name" : "smith" 
                }
            }
        }
    }
}

全文檢索

如果我們想要搜尋員工當中,about 欄位有提及到“rock climbing”的員工

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

我們能得到類似下面的 response

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.16273327,
      "hits": [
         {
            ...
            "_score":         0.16273327,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_score":         0.016878016,
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

在默認情況下 Elasticsearch 會對於搜尋結果做相關性的評分評且排序

短語搜尋

若是我們想要確切的匹配若干個單字或是短語(phrase),例如我們想要搜尋同時包含 rock 和 climbing(並且是相鄰的)員工,我們只需要把原來的 match 改成 match_pharase

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

Hightlight 搜尋結果

我們只需要在之前的語句當中加入highlight關鍵字

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

返回的結果當中會有一個新的部分叫做highlight並且用<em></em>包起來

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            },
            "highlight": {
               "about": [
                  "I love to go <em>rock</em> <em>climbing</em>" 
               ]
            }
         }
      ]
   }
}

分析

Elasticsearch 也可以讓我們透過類似 SQL GROUP BY 的功能來提供管理者作分析,稱為聚合 (Aggregation)

假設我們想要找到所有員工共通的興趣愛好是什麼,我們可以發送以下請求

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

Elasticsearch 會返回我們以下結果

{
   ...
   "hits": { ... },
   "aggregations": {
      "all_interests": {
         "buckets": [
            {
               "key":       "music",
               "doc_count": 2
            },
            {
               "key":       "forestry",
               "doc_count": 1
            },
            {
               "key":       "sports",
               "doc_count": 1
            }
         ]
      }
   }
}

分散性特色:備份不重複

Image of Elasticsearch example image of elasticsearch backup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment