spaghetticode/06.updates.md

## 06.updates.md

      
    Raw
  

              06.updates.md
            
          
    This second paragraph in total looks very nice. It even shows how to use the "suggest" feature standalone. However, whether or not it may be  embedded and executed in the actual search, we do not know. (Note:  we're still in operation "2d) search")

You can combine the suggester with a query. You can use the same term or even use different words for suggestion and query. Let's see an example:
GET /docs/doc/_search
{
  "query": {
    "match": {"title": "liitle neemo"}
  },
  "suggest" : {
    "maybe-you-mean-from-title" : {
      "text" : "neemo",
      "term" : {
        "field" : "title.en"
      }
    },
    "maybe-you-mean-from-body" : {
      "text" : "liitle",
      "term" : {
        "field" : "body"
      }
    }  
  }
}

Here's the output:
{
   "took": 7,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   },
   "suggest": {
      "maybe-you-mean-from-title": [
         {
            "text": "neemo",
            "offset": 0,
            "length": 5,
            "options": [
               {
                  "text": "nemo",
                  "score": 0.75,
                  "freq": 69
               }
            ]
         }
      ],
      "maybe-you-mean-from-body": [
         {
            "text": "liitle",
            "offset": 0,
            "length": 6,
            "options": [
               {
                  "text": "little",
                  "score": 0.8333333,
                  "freq": 46
               }
            ]
         }
      ]
   }
}
Running this request you will get no results for the matchquery part of "liitle neemo", because you're making a fulltext search without fuzzy matching. On the other hand, you will get suggestions from the suggest part of the request. This is good if you want to build an automatic suggestion system when no search results could be retrieved.
We're looking for words which we  do  not know, but which occurred in the current result set of that search query quite (but not too) often.

This query will also extract the significant terms in the resultset:
GET /docs/doc/_search
{
  "query" : {
    "match" : {"title":"nemo" }
  },
 "aggregations" : {
   "frequent-words" : {
     "significant_terms" : { "field" : "body", "size": 10 }
   }
  }
}
Here we query for the documents with title that match "nemo", and we ask for the 10 ("size":10) most significant terms in the resultset body fields ("field": "body").
With the current indexed data you won't get any significant_terms result, because there's not enough stuff to work on. You can quickly add some semi-random data with this ruby script:
def randomize(array, size)
  array.shuffle[0..size-1].join(' ')
end

4.upto(100) do |i|
  system %(curl -XPUT  "http://localhost:9200/docs/doc/#{i}" -d'
    {
      "title":   "#{randomize %w[Finding Nemo Swordfish], 2}",
      "body":    "#{randomize %w[A beautyful cartoon about a little fish named Nemo with lots of fun and the usual happy ending], 9}",
      "keywords": "#{randomize %w[kids cartoons pixar action families], 3}"
    }')
end
After running the query once again you should get some significant terms. This is the beginning of the interesting part of the resultset:
   "aggregations": {
      "frequent-words": {
         "doc_count": 53,
         "buckets": [
            {
               "key": "lots",
               "doc_count": 31,
               "score": 0.12783315533404532,
               "bg_count": 48
            },
            {
               "key": "ending",
               "doc_count": 29,
               "score": 0.08984040659582045,
               "bg_count": 47
            },
Each significant word is returned in a bucket JSON object.
doc_count is how many times the word appears in the resultset, while bg_count is how many times the word appears in the whole index. If the word is quite much more frequent in the resultset than in the index then it gets listed among the significant_terms,
1. One single example query for a SEARCH (just include two MUST and two NOT words, since phrases can't do fuzzy properly)

2. ...which also outputs 'interesting' words in the result set, as specified above (alternatively, you could also use that experimental 'significant terms', whatever suits you).

GET /docs/doc/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "title.en": "find"}},
        { "match": { "title": "nemo"}}
      ],
      "must_not": [
        { "match": { "title": "transformers"}},
        { "match": { "title": "robin"}}
      ]
    }
  },
  "aggregations" : {
    "frequent-words" : {
      "significant_terms" : { "field" : "body", "size": 10 }
    }
  }
}
Please consider that significant_terms is expensive and you should limit the size of buckets returned to what you really need (the example uses "size": 10 which I think can be more than enough).