Skip to content

Instantly share code, notes, and snippets.

@reiddraper
Created May 17, 2012 20:08
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save reiddraper/2721295 to your computer and use it in GitHub Desktop.
Save reiddraper/2721295 to your computer and use it in GitHub Desktop.

Riak Users,

I've been working on some new features for Riak's secondary indexes (2i), and I wanted to get some feedback from the community before proceeding much further.

How does 2i work currently?

2i lets you query index values with an equality query, or with an inclusive range query. The results are not sorted in any way.

Basic proposed changes

  1. Results will now be sorted by [IndexValue, PrimaryKey] in ascending-order. I'm unsure if supporting descending-order queries is important-enough to do now, or add in later.

  2. Queries can specify a limit for the number of results to return. As expected, the limit is applied after sorting

  3. Queries can constrain the results returned based on the index-value being eq (equal), lt (less than), lte (less-than-or-equal), gt or gte. Queries can also provide no constraints, which will return the whole index (before limiting that is).

  4. Queries can "continue" where a previous query left-off. This provides a limited form of pagination. Note, there is no snapshotting or isolation between queries using a continuation. The continuations have no server-side state, and last forever.

Examples

Queries are described as JSON data-structures. The query is for a specific Bucket/Index, which is specified either in the path (for HTTP) or in the PB message.

// return the first 100 keys 
// whose index value is
// greater-than-or-equal to "bar" and
// less-than "foo"
{
    "limit": 100, 
    "query": {
        "gte": "bar", 
        "lt": "foo"
    }
}

// this query will return a continuation that will
// let you ask for more results, starting at
// the 101st value
// return all the keys whose index
// value is greater-than 50
{
    "query": {
        "gt": 50, 
    }
}
// return all of the keys whose
// index value is equal to 33
{
    "query": {
        "eq": 33
    }
}
// ask for 100 more results using a continuation
// from a previous query. Note, requests with
// continuations should _not_ include a query
// section, as that is wrapped up in the continuation
// already
{
    "continuation": "a85hYGBgymBKYWBKLcwFshklQRyWnMTiErAUR", 
    "limit": 100
}

Questions

  1. Will it help you do things with 2i that you wanted to do but previously weren't able?

  2. How important is it to be able to specify the sort oder (ASC vs. DESC)?

  3. Any other thoughts are appreciated as well.

@peschkaj
Copy link

This would definitely be very helpful in building rich applications with Riak without having to resort to some middle ware to apply sorting across an intermediate result set.

Regarding ordering - being able to specify sort order would be a huge boon, although I can only begin to imagine how difficult this would be in a distributed system and the latency that might introduce into a system.

Otherwise, ignoring the implementation details, I'd love to see ordering on either the index, the query, or both.

Are strings sorted ASCII-betically? I ask because ASCIIbetical sorting would make it easy to do "begins with" type queries, which is often a common case for querying - e.g. with an index on a properly ISO formatted timestamp, I can easily find and aggregate at various levels.

@batasrki
Copy link

Ordering of any kind will help greatly, in my experience. It'd be nice to have the descending ordering, as well as ascending, but if the latter is easier, I'm OK with punting on the former right now.

Is there any reason why the 2i stuff is represented as a JSON? I guess, coming from Ruby, I can pass in a Hash that'll get serialized into JSON, but is this an efficient method for other languages, especially ones with more ceremony like Java?

Other than that, it looks great. I very much like the continuation option. I need to experiment with the current 2i implementation to see how it behaves with dates before making a comment on that.

@kevburnsjr
Copy link

All the features I ever wanted!

If you really need desc order you could just add another index equal to 0 minus the first index. Donezo.

@reiddraper
Copy link
Author

Regarding ordering - being able to specify sort order would be a huge boon, although I can only begin to imagine how difficult this would be in a distributed system and the latency that might introduce into a system.

The individual vnodes are sorted, so the implementation does a streaming merge-sort, with backpressure. That being said, I haven't really pushed it's limits yet...

Otherwise, ignoring the implementation details, I'd love to see ordering on either the index, the query, or both.

Since the values are sorted on disk by [IndexValue, PrimaryKey], that's by far the most efficient sorting to do.

Are strings sorted ASCII-betically? I ask because ASCIIbetical sorting would make it easy to do "begins with" type queries, which is often a common case for querying - e.g. with an index on a properly ISO formatted timestamp, I can easily find and aggregate at various levels.

Binary (or string) indexes are just treated as blobs, so they sort in "binary order". Effectively this means lexicographically.


Is there any reason why the 2i stuff is represented as a JSON? I guess, coming from Ruby, I can pass in a Hash that'll get serialized into JSON, but is this an efficient method for other languages, especially ones with more ceremony like Java?

I personally prefer queries as data-structures > strings. I'd imagine the Java interface might be something like:

query = new Query().withGt("foo").withLte("zebra").withLimit(100).submit();

but I'm not a Java dev :)

@ericmoritz
Copy link

Is there a dev branch in github? I'd love to try it out with something I'm building.

@jeraymond
Copy link

Something that would help me a lot would be the ability to combine index queries, to be able to get values that match one index AND another index OR another index. That combined with pagination and sorting would be useful. Combined index queries you can't do yourself, you have to do multiple separate queries. Things like sorting and limiting the number of returned results you can do with reduces phases that are fed from an index lookup.

@reiddraper
Copy link
Author

Is there a dev branch in github? I'd love to try it out with something I'm building.

There is, though it currently only does a subset of what I've talked about here. It was mostly a proof-of-concept of the streaming merge-sort implementation. The branch is here.


Something that would help me a lot would be the ability to combine index queries

Agreed. I've tried to not bite off too much with this project, and currently multi-index queries has fallen on the "future" side of the fence. It's worth noting that Riak Search does let you do multi-index queries now.

Combined index queries you can't do yourself,

There are some hacks you can do, by generating your own compound 2i indexes. You can abuse binary indexes by packing your own compound index in there. If people are interested, I could write something up about how to do this.

@ddosia
Copy link

ddosia commented May 18, 2012

And how about result limits? Since all data is located on different vnodes you cant effectively stop searching next results when you are done.

@jeraymond
Copy link

There are some hacks you can do, by generating your own compound 2i indexes. You can abuse binary indexes by packing your own compound index in there. If people are interested, I could write something up about how to do this.

Yea this is what I do now, make a compound index out of two distinct indexes. But then I end up with a third index taking up more resources across the cluster; the clients need to know how to combine separate indexes into these combined ones to do lookups; and every time I need a new combined index for a lookup I need to reindex millions of items.

Copy link

ghost commented May 18, 2012

Rather like jeraymond I would find being able to query on more than one index more useful than sorting - as the hacks to get around multiple index queries are more unpleasant than applying a sort myself - not that being able to sort Riak-side is unwelcome!

@reiddraper
Copy link
Author

And how about result limits? Since all data is located on different vnodes you cant effectively stop searching next results when you are done.

@ddosia The coordinating FSM simply stops the vnode folds when enough results have been accumulated.


@jeraymond, @Bufferine I agree multi-index queries would be useful. They're unfortunately outside the scope of this project, but rest-assured we're thinking about them for the future. In the meantime, Riak Search might fit your use-case.

Copy link

ghost commented May 18, 2012

Ah well can't have everything all at once :)

Yep, using those, as well as some creative secondary indexing where it's appropriate.

@armon
Copy link

armon commented May 18, 2012

It would be incredibly nice to have an option of "materializing" the results and getting the actual objects back instead of just the keys. This would allow us to avoid an additional M/R step.

@reiddraper
Copy link
Author

@armon Yes, that's very much on our radar as well :)

@eliaslevy
Copy link

I am late to this party, but descending sorting would be rather useful. Many online applications make queries that ask for the latest N items. For that to work efficiently with a limit you need a descending sort order. There are ways to get around it. You can issue multiple bracketed queries, but that is inefficient. Or you can rewrite your index value so that your lower values become your highest and vice versa. But it would be great to have native support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment