Skip to content

Instantly share code, notes, and snippets.

@spaghetticode
Last active October 18, 2018 08:38
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save spaghetticode/ba60dcc23583c438bdb3 to your computer and use it in GitHub Desktop.
Save spaghetticode/ba60dcc23583c438bdb3 to your computer and use it in GitHub Desktop.
06.updates

This second paragraph in total looks very nice. It even shows how to use the "suggest" feature standalone. However, whether or not it may be embedded and executed in the actual search, we do not know. (Note: we're still in operation "2d) search")

You can combine the suggester with a query. You can use the same term or even use different words for suggestion and query. Let's see an example:

GET /docs/doc/_search
{
  "query": {
    "match": {"title": "liitle neemo"}
  },
  "suggest" : {
    "maybe-you-mean-from-title" : {
      "text" : "neemo",
      "term" : {
        "field" : "title.en"
      }
    },
    "maybe-you-mean-from-body" : {
      "text" : "liitle",
      "term" : {
        "field" : "body"
      }
    }  
  }
}

Here's the output:

{
   "took": 7,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   },
   "suggest": {
      "maybe-you-mean-from-title": [
         {
            "text": "neemo",
            "offset": 0,
            "length": 5,
            "options": [
               {
                  "text": "nemo",
                  "score": 0.75,
                  "freq": 69
               }
            ]
         }
      ],
      "maybe-you-mean-from-body": [
         {
            "text": "liitle",
            "offset": 0,
            "length": 6,
            "options": [
               {
                  "text": "little",
                  "score": 0.8333333,
                  "freq": 46
               }
            ]
         }
      ]
   }
}

Running this request you will get no results for the matchquery part of "liitle neemo", because you're making a fulltext search without fuzzy matching. On the other hand, you will get suggestions from the suggest part of the request. This is good if you want to build an automatic suggestion system when no search results could be retrieved.

We're looking for words which we do not know, but which occurred in the current result set of that search query quite (but not too) often.

This query will also extract the significant terms in the resultset:

GET /docs/doc/_search
{
  "query" : {
    "match" : {"title":"nemo" }
  },
 "aggregations" : {
   "frequent-words" : {
     "significant_terms" : { "field" : "body", "size": 10 }
   }
  }
}

Here we query for the documents with title that match "nemo", and we ask for the 10 ("size":10) most significant terms in the resultset body fields ("field": "body").

With the current indexed data you won't get any significant_terms result, because there's not enough stuff to work on. You can quickly add some semi-random data with this ruby script:

def randomize(array, size)
  array.shuffle[0..size-1].join(' ')
end

4.upto(100) do |i|
  system %(curl -XPUT  "http://localhost:9200/docs/doc/#{i}" -d'
    {
      "title":   "#{randomize %w[Finding Nemo Swordfish], 2}",
      "body":    "#{randomize %w[A beautyful cartoon about a little fish named Nemo with lots of fun and the usual happy ending], 9}",
      "keywords": "#{randomize %w[kids cartoons pixar action families], 3}"
    }')
end

After running the query once again you should get some significant terms. This is the beginning of the interesting part of the resultset:

   "aggregations": {
      "frequent-words": {
         "doc_count": 53,
         "buckets": [
            {
               "key": "lots",
               "doc_count": 31,
               "score": 0.12783315533404532,
               "bg_count": 48
            },
            {
               "key": "ending",
               "doc_count": 29,
               "score": 0.08984040659582045,
               "bg_count": 47
            },

Each significant word is returned in a bucket JSON object.

doc_count is how many times the word appears in the resultset, while bg_count is how many times the word appears in the whole index. If the word is quite much more frequent in the resultset than in the index then it gets listed among the significant_terms,

1. One single example query for a SEARCH (just include two MUST and two NOT words, since phrases can't do fuzzy properly)

2. ...which also outputs 'interesting' words in the result set, as specified above (alternatively, you could also use that experimental 'significant terms', whatever suits you).

GET /docs/doc/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title.en": "find"}},
        { "match": { "title": "nemo"}}
      ],
      "must_not": [
        { "match": { "title": "transformers"}},
        { "match": { "title": "robin"}}
      ]
    }
  },
  "aggregations" : {
    "frequent-words" : {
      "significant_terms" : { "field" : "body", "size": 10 }
    }
  }
}

Please consider that significant_terms is expensive and you should limit the size of buckets returned to what you really need (the example uses "size": 10 which I think can be more than enough).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment