yaasita/memo.md

## memo.md

      
    Raw
  

              memo.md
            
          
    N-gramの検索

以下の例は全てKibana上で試した
インデックスの作成

以下の設定でIndexを作成
PUT users
{
      "settings": {
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "ngram_tokenizer"
                }
            },
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 1,
                    "max_gram": 2,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        }
    }
}

データの投入

abcというユーザーを作成
PUT users/user/abc
{
  "user_id": "abc"
}

usersインデックスには上記1件しかない状態とする
このデータがどの様に格納されているか？
GET users/user/abc/_termvectors?fields=user_id

# response
  "_index": "users",
  "_type": "user",
  "_id": "abc",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "user_id": {
      "field_statistics": {
        "sum_doc_freq": 5,
        "doc_count": 1,
        "sum_ttf": 5
      },
      "terms": {
        "a": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "ab": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "b": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 1,
              "end_offset": 2
            }
          ]
        },
        "bc": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 3,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        },
        "c": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 4,
              "start_offset": 2,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }

この様にn-gramの最小値1と、最大値2で分解されている

n-gram = 1 の索引情報

a
b
c


n-gram = 2 の索引情報

ab
bc


検索の実行

ここで以下のbbを検索クエリとして投げる
この検索文字列は以下の様に分解される
POST users/_analyze
{
  "text": "bb"
}

# response
{
  "tokens": [
    {
      "token": "b",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "bb",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "b",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    }
  ]
}

この検索文字は b, bb, b という3つに分解されそれぞれで検索が実行される
先ほどの索引情報と比較して見ると

n-gram = 1 の索引情報

a
b ← 1個目のbとマッチ、3個目のbとマッチ
c


n-gram = 2 の索引情報

ab
bc


ということで２回マッチするので検索結果に表示される
GET users/user/_search
{
  "query": {
    "match": {
      "user_id": "bb"
    }
  }
}

# response
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "users",
        "_type": "user",
        "_id": "abc",
        "_score": 0.5753642,
        "_source": {
          "user_id": "abc"
        }
      }
    ]
  }
}

同様にhbcを検索して見る
POST users/_analyze
{
  "text": "hbc"
}

# response
{
  "tokens": [
    {
      "token": "h",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "hb",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "b",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 2
    },
    {
      "token": "bc",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "c",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 4
    }
  ]
}

今度は h, b, c, hb, bc という5つに分解された
先ほどの索引情報と比較して見ると

n-gram = 1 の索引情報

a
b ← 2個目のbとマッチ
c ← 3個目のhとマッチ


n-gram = 2 の索引情報

ab
bc ← 5個目のbcとマッチ


3回マッチしたので先ほどよりスコアが高くなる
GET users/user/_search
{
  "query": {
    "match": {
      "user_id": "hbc"
    }
  }
}

# response
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8630463,
    "hits": [
      {
        "_index": "users",
        "_type": "user",
        "_id": "abc",
        "_score": 0.8630463,
        "_source": {
          "user_id": "abc"
        }
      }
    ]
  }
}

N-gramの設定を変えてみる

今度はN-gram最小=2, 最大=3にしてみる
一度削除して作り直す
DELETE users

PUT users
{
      "settings": {
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "ngram_tokenizer"
                }
            },
            "tokenizer": {
                "ngram_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 3,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        }
    }
}

データの投入

abcというユーザーを作成
PUT users/user/abc
{
  "user_id": "abc"
}

usersインデックスには上記1件しかない状態とする
このデータがどの様に格納されているか？
GET users/user/abc/_termvectors?fields=user_id

# response
{
  "_index": "users",
  "_type": "user",
  "_id": "abc",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "user_id": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 1,
        "sum_ttf": 3
      },
      "terms": {
        "ab": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "abc": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 0,
              "end_offset": 3
            }
          ]
        },
        "bc": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

この様にn-gramの最小値2と、最大値3で分解されている

n-gram = 2 の索引情報

ab
bc


n-gram = 3 の索引情報

abc


検索の実行

ここで以下のbbを検索クエリとして投げる
n-gramは最小値=2なので分解されず bb のまま索引情報と比較されるが
bbは存在しないのでマッチしない
GET users/user/_search
{
  "query": {
    "match": {
      "user_id": "bb"
    }
  }
}

# response
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

また、一文字の検索は索引情報がないためどの様な文字でもヒットしない
(a, b, cそれぞれ一文字で検索してもヒットしない)
同様にhbcを検索して見る
POST users/_analyze
{
  "text": "hbc"
}

# response
{
  "tokens": [
    {
      "token": "hb",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "hbc",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "bc",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 2
    }
  ]
}

今度は hb, bc, hbc という3つに分解された
先ほどの索引情報と比較して見ると

n-gram = 2 の索引情報

ab
bc ← 2個目のbcとマッチ


n-gram = 3 の索引情報

abc


なので検索するとヒットする
GET users/user/_search
{
  "query": {
    "match": {
      "user_id": "hbc"
    }
  }
}

# response
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "users",
        "_type": "user",
        "_id": "abc",
        "_score": 0.2876821,
        "_source": {
          "user_id": "abc"
        }
      }
    ]
  }
}