Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@pmeskers
Last active November 7, 2016 22:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pmeskers/eee33df28af6556b27836fd38255f366 to your computer and use it in GitHub Desktop.
Save pmeskers/eee33df28af6556b27836fd38255f366 to your computer and use it in GitHub Desktop.

Let's say we have an index filled with publisher documents. A publisher has a collection of books, and each book has a title, a published flag, and a collection of genre scores. A genre_score represents how well a particular book matches a particular genre, or in this case a genre_id.

First, let's define some mappings (for simplicity, we will only be explicit about the nested types):

curl -XPUT 'localhost:9200/book_index' -d '
  {
    "mappings": {
      "publisher": {
        "properties": {
          "books": {
            "type": "nested",
            "properties": {
              "genre_scores": {
                "type": "nested"
              }
            }
          }
        }
      }
    }
  }'

Here are our two publishers:

curl -XPUT 'localhost:9200/book_index/publisher/1' -d '
  {
    "name": "Best Books Publishing",
    "books": [
      {
        "name": "Published with medium genre_id of 1",
        "published": true,
        "genre_scores": [
          { "genre_id": 1, "score": 50 },
          { "genre_id": 2, "score": 15 }
        ]
      }
    ]
  }'
  
curl -XPUT 'localhost:9200/book_index/publisher/2' -d '
  {
    "name": "Puffin Publishers",
    "books": [
      {
        "name": "Published book with low genre_id of 1",
        "published": true,
        "genre_scores": [
          { "genre_id": 1, "score": 10 },
          { "genre_id": 4, "score": 10 }
        ]
      },
      {
        "name": "Unpublished book with high genre_id of 1",
        "published": false,
        "genre_scores": [
          { "genre_id": 1, "score": 100 },
          { "genre_id": 2, "score": 35 }
        ]
      }
    ]
  }'

And here is the final definition of our index & mappings...

curl -XGET 'localhost:9200/book_index/_mappings?pretty=true'
...
{
  "book_index": {
    "mappings": {
      "publisher": {
        "properties": {
          "books": {
            "type": "nested",
            "properties": {
              "genre_scores": {
                "type": "nested",
                "properties": {
                  "genre_id": {
                    "type": "long"
                  },
                  "score": {
                    "type": "long"
                  }
                }
              },
              "name": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "published": {
                "type": "boolean"
              }
            }
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

Now suppose we want to query for a list of publishers, and have them sorted by those who books performing well in a particular genre. In other words, sort the publishers by the genre_score.score of one of their books for the target genre_id.

We might write a search query like this...

curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
  {
    "size": 5,
    "from": 0,
    "sort": [
      {
        "books.genre_scores.score": {
          "order": "desc",
          "nested_path": "books.genre_scores",
          "nested_filter": {
            "term": {
              "books.genre_scores.genre_id": 1
            }
          }
        }
      }
    ],
    "_source":false,
    "query": {
      "nested": {
        "path": "books",
        "query": {
          "bool": {
            "must": []
          }
        },
        "inner_hits": {
          "size": 5,
          "sort": []
        }
      }
    }
  }'

Which correctly returns the Puffin (with a sort value of [100]) first and Best Books second (with a sort value of [50]).

But suppose we only want to consider books for which published is true. This would change our expectation to have Best Books first (with a sort of [50]) and Puffin second (with a sort of [10]).

Let's update our nested_filter and query to the following...

curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
  "size": 5,
    "from": 0,
    "sort": [
      {
        "books.genre_scores.score": {
          "order": "desc",
          "nested_path": "books.genre_scores",
          "nested_filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "books.genre_scores.genre_id": 1
                  }
                }, {
                  "term": {
                    "books.published": true
                  }
                }
              ]
            }
          }
        }
      }
    ],
    "_source": false,
    "query": {
      "nested": {
        "path": "books",
        "query": {
          "term": {
            "books.published": true
          }
        },
        "inner_hits": {
        "size": 5,
        "sort": []
      }
    }
  }
}'

Suddenly, our sort values for both publishers has become [-9223372036854775808].

Why does adding an additional term to our nested_filter in the top-level sort have this impact?

@rmboyle
Copy link

rmboyle commented Nov 7, 2016

Still having an issue getting this up and running, but have you looked into adding a :boost => 0 to each query?

https://www.elastic.co/guide/en/elasticsearch/guide/1.x/query-time-boosting.html

I believe this should resolve the issue.

@pmeskers
Copy link
Author

pmeskers commented Nov 7, 2016

The documentation does imply that any clause without a boost should default to 1, I have tried the following...

Running it on Elasticsearch 1.7 (which requires the boost be moved to the query and not the sort):

curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
  "size": 5,
    "from": 0,
    "sort": [
      {
        "books.genre_scores.score": {
          "order": "desc",
          "nested_path": "books.genre_scores",
          "nested_filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "books.genre_scores.genre_id": {
                      "value": 1
                    }
                  }
                }, {
                  "term": {
                    "books.published": {
                      "value": true
                    }
                  }
                }
              ]
            }
          }
        }
      }
    ],
    "_source": false,
    "query": {
      "nested": {
        "path": "books",
        "query": {
          "term": {
            "books.published": {
              "value": true,
              "boost": 0
            }
          }
        },
        "inner_hits": {
        "size": 5,
        "sort": []
      }
    }
  }
}'

And am still getting scores of [-9223372036854775808], unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment