Skip to content

Instantly share code, notes, and snippets.

Last active December 22, 2015 15:49
Show Gist options
  • Save saxxi/6495116 to your computer and use it in GitHub Desktop.
Save saxxi/6495116 to your computer and use it in GitHub Desktop.
# Elastic search grouping solution
# As at present ElasticSearch does not provide a group_by equivalent, here's my attempt to do it manually.
# In the example we have articles made by some authors and I'd like to have relevant docs, but not more than one per author.
# Assumption.
# 1) I'm looking for relevant content
# 2) I've assumed that first 300 docs are relevant,
# So I consider only this selection, regardless many of these are from the same few authors.
# 3) for my needs I didn't "really" needed pagination, for me it was enough a "show more" button updated through ajax
`curl -X DELETE "http://localhost:9200/articles"
curl -X PUT "http://localhost:9200/articles" -d '{
"settings": {
"index": {
"number_of_shards": 1, "number_of_replicas": 0
curl -X POST "http://localhost:9200/articles/article" -d '{ "id": 111, "author_id": "user_1", "title": "One bad doc", "findable": true }'
curl -X POST "http://localhost:9200/articles/article" -d '{ "id": 222, "author_id": "user_2", "title": "Two bad doc", "findable": true }'
curl -X POST "http://localhost:9200/articles/article" -d '{ "id": 333, "author_id": "user_3", "title": "Three good doc", "findable": true }'
curl -X POST "http://localhost:9200/articles/article" -d '{ "id": 444, "author_id": "user_1", "title": "Four good doc", "findable": true }'
curl -X POST "http://localhost:9200/articles/article" -d '{ "id": 555, "author_id": "user_2", "title": "Five good doc", "findable": true }'
curl -X POST "http://localhost:9200/articles/article" -d '{ "id": 666, "author_id": "user_1", "title": "Six good doc", "findable": true }'
curl -XPOST 'http://localhost:9200/articles/_refresh'`
# # Raw test our query
# curl -X POST "http://localhost:9200/articles/_search?pretty=true" -d '{
# "query": {
# "bool":{
# "must":[{ "query_string":{ "query":"doc", "default_operator":"AND" } }],
# "should":[{ "query_string":{ "query":"user_2", "default_operator":"AND", "boost":2000 } }]
# }
# },
# "fields": [{ "term": { findable: "true" } }],
# "facets": {
# "tags": { "terms": {"field": "owner", "size": 10} }
# }
# }'
params_start_from = 0
per_page = 3
my_query = {
bool: {
must: [{ query_string: { query: "doc", default_operator: "AND" } }],
should: [{ query_string: { query: "user_2", default_operator: "AND", boost: 2000 } }]
my_and_filters = [
{ term: { findable: "true" } }
# FIRST QUERY - find all relevant ids
all_res = 'articles', query: my_query,
filter: { :and => my_and_filters },
fields: ['id', 'author_id'],
size: 300
docs = all_res.results.to_a.uniq { |el| el['author_id'] }
@total_results_non_unique = # <-- Global variable
@total_results = docs.size # <-- Global variable
start_from = params_start_from.to_i # should always be < Settings.research.max_results
docs = docs[ start_from .. start_from + per_page - 1 ]
doc_ids = docs.nil? ? [] : { |doc| doc['id'] }
and_filters << { ids: { values: doc_ids } } # TODO: move :highlight to Part 1 and query only by :id
res = 'articles', query: my_query,
filter: { :and => my_and_filters },
highlight: {
fields: ['title']
size: per_page
Copy link

Hey @saxxi, if you don't mind, we can continue your problem here?

I know both ruby and tire so I can discuss this gist right away.

Copy link

saxxi commented Sep 11, 2013

yes we can skype too if you like, I'm "aditonskype"

Copy link

saxxi commented Sep 11, 2013

As I've commented in my own stackoverflow question this solution has some drawbacks.

Lets examine how it works

I assume in my case an "author" (its a fake example) has on average 1.8 documents per head. I'll also set a phisical limit of -say- 10 per author.

  1. first the engine finds 300 docs
    • find only :id and :author_id
    • probably get also the :_highlight field
  2. in ruby group_by author_id (uniq) and cut the interested part (pagination)
  3. with this array -> query elasticsearch and have final results

Probably on step 3 we could even use standard activerecord (I use mongoid but still..) find method:


Please consider the following two options:

  1. use the _highlight field in first query (BUT have 300 _highlight text floating around the web)
  2. or use the full query + _highlight field on the second field (BUT process a complex query 2 full times)

Which would you consider a better option? I know, bechmarking is our friend, isn't it?! :D

Copy link

saxxi commented Sep 11, 2013

Now the gist is fully working

Copy link

I guess, the concept seems working right (step 1, 2, 3)?

Step 1 can be replaced by doing a faceting query instead.

Copy link

saxxi commented Sep 12, 2013

This is a satisfying solution for the present time (it's on staging env) but I would be glad to improve it, could you please provide some deeper hint on step 1? Thank you

Copy link

If you use term facet, you can get list of all author_id sorted by some conditions, you don't have to limit yourself to 300 number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment