Skip to content

Instantly share code, notes, and snippets.

@spaghetticode
Last active October 18, 2018 08:40
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save spaghetticode/1b090110f302f548ab49 to your computer and use it in GitHub Desktop.
Save spaghetticode/1b090110f302f548ab49 to your computer and use it in GitHub Desktop.
More queries

-OR (better), if feasible:

whole phrases instead of words for context-awareness

Let's see some possible solutions. This one will look for the three words in the body, but requires that only 2 of them are present:

GET /docs/doc/_search
{
  "query": {
    "match": {
      "body": {
        "query": "small fish cartoon",
        "minimum_should_match": 2
      }
    }
  }
}

You can also use percentages with minimum_should_match. The following example will require 2 words out of 3 to match:

GET /docs/doc/_search
{
  "query": {
    "match": {
      "body": {
        "query": "small fish facebook",
        "minimum_should_match": "66%"
      }
    }
  }
}

If you increase the percentage to 67 no document will match.

match_phrase

When you want to find words in the exact order you can use phrase matching. Vanilla phrase matching will find only records with the exact word order. This example yields no results as our "nemo" document contains "little fish named Nemo":

GET /docs/doc/_search
{
  "query": {
    "match": {
      "body": {
        "query": "little fish nemo",
        "type":  "phrase"
      }
    }
  }
}

You may not want 100% same wording. Adding the slop attribute will allow "fish little nemo", "little nemo fish", "little fish xxx nemo" results to be matched as well. slop value represents how apart terms are allowed to be while still considering the document a match. Higher values will be more tolerant. Now we get a result:

GET /docs/doc/_search
{
  "query": {
    "match": {
      "body": {
        "query": "little fish nemo",
        "type":  "phrase",
        "slop": 1
      }
    }
  }
}

This query can be written also this way:

GET /docs/doc/_search
{
  "query": {
    "match_phrase": {
      "body": {
        "query": "little fish nemo",
        "slop": 1
      }
    }
  }
}

Multiple fields queries

Use the multi_match attribute. This is a simple example:

GET /docs/doc/_search
{
  "query": {
    "multi_match": {
      "fields": ["title", "body", "descriptions"],
      "query": "nemo facebook twitter"
    }
  }
}

Query words are ORed by default, if you want to AND them you can add operator. Since query words now are ANDed then no result will be returned:

GET /docs/doc/_search
{
  "query": {
    "multi_match": {
      "fields": ["title", "body", "descriptions"],
      "query": "nemo facebook twitter",
      "operator": "and"
    }
  }
}

You can use wildcards to pick fields. Let's consider the following example, which doesn't use wildcards and yelds no result, as the title field is not analyzed with the english analizer:

GET /docs/doc/_search
{
  "query": {
    "multi_match": {
      "fields": ["title", "body"],
      "query": "find"
    }
  }
}

This query on the other hand uses a wildcard to include title.en field as well, so the query yields the usual "Finding Nemo" result:

GET /docs/doc/_search
{
  "query": {
    "multi_match": {
      "fields": ["title*", "body"],
      "query": "find"
    }
  }
}

Nice to know: the "_all" field

_all is a special field that gets populated at index time for each inserted records. It concatenates all the data contained in the fields into the single "_all" attribute. By default text is analyzed with the standard analyzer. It can be used for queries as well, as a quick & dirty substitute for multifield queries:

GET /docs/doc/_search
{
  "query": {
    "match": {
      "_all": "finding"
    }
  }
}

This field can be disabled in order to save disk space/ram or customized for special needs.

-result limit (start,count), if applicable

GET docs/doc/_search
{
  "from": 1,
  "size": 2
}

size is count, from is start

Here's a more descriptive query. The must_not filter will match all the currently indexed documents, we then pick 2 of them leaving out the first:

GET /docs/_search
{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must_not": {"term" : {"title": "facebook"}}
        }
      }
    }
  },
  "size": 2,
  "from": 1
}

-location + max. distance

First, let's add some geo data to the existing records. We're going to use partial updates to avoid retyping all documents:

POST /docs/doc/1/_update
{
  "doc": {
    "location": {
      "lat": 45.490946,
      "lon": 9.206543
    }
  }
}

POST /docs/doc/2/_update
{
  "doc": {
    "location": {
      "lat": 46.490946,
      "lon": 10.206543
    }
  }
}

POST /docs/doc/3/_update
{
  "doc": {
    "location": {
      "lat": 76.490946,
      "lon": 30.206543
    }
  }
}

Let's look for things within 100km distance:

GET /docs/doc/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance": "100km",
          "location": {
            "lat": 45,
            "lon": 10
          }
        }
      }
    }
  }
}

This is a filter, but unlike other filers is not cached. Why? Because "location" is very likely to change at each request, making caching worthless. You can still enable this kind of caching if "location" coordinates are consistent among your queries.

You can see result distances from given coordinates using sort:

GET /docs/doc/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance": "120km",
          "location": {
            "lat": 46,
            "lon": 10
          }
        }
      }
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": {
          "lat":  46,
          "lon": 10
        },
        "order":         "asc",
        "unit":          "km",
        "distance_type": "plane"
      }
    }
  ]
}

You can use some geopoint to select results and another one for sorting purpose. order, unit should be self explanatory. distance_type is the algorythm used for calculations: plane is the fastest but quite inaccurate (but it's ok for big distances), sloppy_arc is the default, and arc is the slowest but most accurate.

@spaghetticode
Copy link
Author

-Can 'slop >= 1' be used without impact on matching speed?

Of course slop (proximity query) will impact speed.
Some benchmarks show that “slop” roughly doubles the time of the match_phrase query… but it may still be fast enough for you (doubling 1ms is no problem, doubling 1 second is likely an unacceptable delay).

Look at this example:

GET /docs/doc/_search
{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            { "term": { "body": "little"}},
            { "term": { "body": "fish"}},
            { "term": { "body": "nemo"}}
          ]
        }
      }
    }
  },
  "rescore": {
    "window_size": 10, 
    "query": {         
      "rescore_query": {
        "match_phrase": {
          "title": {
            "query": "little nemo fish",
            "slop": 10
          }
        }
      }
    }
  }  
}

The first part of the query filters results that include all the words "nemo", "fish", little" in the body without considering their position. term is about 10 times faster than a match_phrase query, so it's very convenient to filter out results.

The second part of the query recalculates the score (hence the ordering) of the results, picking only the first 10. They are scored according to how well they match the phrase "little nemo fish" but allowing a lot of similar phrases to stay in, given the high slop value.
Whenever speed/resources are a constraint try to use filters to restrict the resultset first, and then apply the expensive stuff.

@spaghetticode
Copy link
Author

Is it reasonable to filter locations in default mode, but sort in fast 'plane' mode, as shown in the example?

In the example I wanted to show how to change the precision level, it was not meant as some kind of optimization. But it could be. The speed improvement would be unnoticeable in my opinion if a faster mode is used at sort time, which operates on a selected resultset, while it could be more useful at query time where you still have to pick all the records.
I’d suggest to go with default/higher precision everywhere and then, if queries get slow or you want to improve performance, use a lower precision mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment