PharkMillups/gist:613732

## gistfile1.txt
16:08 <grantr> hey guys me again

16:08 <grantr> i've been thinking about how to implement a queue in riak
(or any similar kv store)

16:08 <grantr> any advice on that?

16:09 <jdmaturen> afaict there are no great ways to concurrently read/write a
given key [queue]. /me waves hands at all the MQs out there

16:09 <grantr> trying to figure out how to structure it so an add or remove can be a single write

16:10 <benblack> with an unmodified store, you can't.

16:10 <grantr> ah

16:11 <benblack> you could do something where you just wrote stuff in, then
fixed it up on read

16:12 <benblack> but it would not have very consistent performance

16:13 <grantr> you'd have to do a lot of list_keys or mapred

16:13 <benblack> no

16:13 <benblack> you'd allow_multi, retrieve all siblings on read, and write
back an ordered set of linked documents

16:13 <benblack> or similar

16:14 <grantr> oh i never thought of that

16:14 <benblack> you'd be better off writing something for the purpose on
riak_core, not on _kv

16:14 <grantr> yeah i had a crazy idea

16:14 <grantr> seems like it would be possible (though maybe not prudent) to wrap
rabbit's queue implementation in riak core

16:15 <grantr> you would probably lose strict ordering

16:15 <benblack> been discussed.

16:16 <grantr> any conclusions?

16:16 <benblack> yes, it's hard.

16:16 <grantr> damn ;)

16:16 <grantr> hard to maintain queue semantics?

16:18 <grantr> on the face of it, it seems relatively easy - routing table maintained by the
ring, each vnode is assigned a queue or some shard of a queue
(number of shards decided at queue creation time)

16:19 <benblack> yeah, so, if you could never say "shard" again in the context of a
dynamo system that'd be awesome

16:20 <grantr> isnt that sort of how vnodes work though? segment of
the keyspace == shard

16:20 <benblack> no

16:20 <benblack> it is really not

16:20 <benblack> those are partitions

16:20 <benblack> and they are integral to the design

16:20 <benblack> not bolted on later

16:20 <grantr> integral to the design of the vnode implementation?

16:20 <benblack> integral to the design of dynamo

16:21 <benblack> as in, it is even lower level than the storage system

16:21 <grantr> hmmm

16:21 <benblack> that's how riak_core can exist

16:21 <grantr> i should read the paper again

16:21 <benblack> you can shard mysql, right?

16:22 <grantr> i guess i had this idea that's it's basically consistent hashing
like with memcached, but managed by the cluster instead of the client

16:22 <grantr> right

16:22 <benblack> ok, or memcache consistent hashing. what is the equivalent of
riak_core in those sharded setups?

16:22 <grantr> gizzard i suppose. some proxy that sits in between the client and
the nodes, and decides which node the request goes to

16:23 <benblack> correct, the answer is "there is no equivalent"

16:23 <benblack> you bolt some stuff on and hope for the best

16:24 <grantr> what is riak core giving you that's more than a consistent
hashing proxy then?

16:24 <grantr> hinted handoff?

16:24 <benblack> it's not a proxy

16:24 <benblack> group membership, partitioning, hh, yes

16:24 <benblack> online reconfiguration and redistribution

16:24 <benblack> and, again, it isn't a proxy

16:24 <grantr> shared state

16:25 <benblack> the nodes are first class actors in the system

16:25 <benblack> aware of each other

16:25 <benblack> shards are generally not aware of each other

16:26 <benblack> the mongo replica sets stuff is similarly bolted on. very limited
set of nodes in a set, weird constraints because of their simplistic leader election,
and, again, sets are not aware of other sets.

16:27 <grantr> i see

16:28 <benblack> witness the 4sq outage report

16:28 <benblack> the glory of autosharding in full flower

16:28 <grantr> i didnt even hear about that, looking

16:29 <benblack> http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/

16:33 <grantr> oh interesting, they're using mongo

16:35 <grantr> so how could that have been avoided with a dynamo system? seems
like a case of unbalanced load, which can happen in dynamo. i guess dynamo systems
would be easier to rebalance?

16:36 <benblack> you add more nodes, partitions are moved

16:36 <pharkmillups> grantr: i can only speak for Riak, but the..

16:36 <benblack> there's no special case

16:36 <pharkmillups> yeah, what benblack said

16:37 <benblack> pharkmillups has returned, i leave you in his capable hands

16:37 <grantr> in riak thats true, in cassandra there is a manual step. but
essentially the same thing. and you get that because rebalancing is built into the system

16:37 <grantr> thanks benblack

16:38 <grantr> i always learn something in this channel!

16:39 <grantr> i should reread the dynamo paper with queues in mind

16:39 <grantr> maybe there is a difficulty with partitioning that i'm missing

16:41 <benblack> it is the eventual consistency that bites you on this one, generally,
rather than partitioning.

16:42 <grantr> got it

16:42 <grantr> two pops at the same time to different nodes could get the same item

16:44 <grantr> and i guess that's a result of all the reads and writes (deletes)
]being in the same "place"

16:45 <grantr> that's how it's different from the typical usage pattern of a kv store

16:46 <grantr> hibari's chain replication could be a better system than dynamo
then for replicated queues

16:46 <grantr> anything with strong consistency
	16:08 <grantr> hey guys me again

	16:08 <grantr> i've been thinking about how to implement a queue in riak
	(or any similar kv store)

	16:08 <grantr> any advice on that?

	16:09 <jdmaturen> afaict there are no great ways to concurrently read/write a
	given key [queue]. /me waves hands at all the MQs out there

	16:09 <grantr> trying to figure out how to structure it so an add or remove can be a single write

	16:10 <benblack> with an unmodified store, you can't.

	16:10 <grantr> ah

	16:11 <benblack> you could do something where you just wrote stuff in, then
	fixed it up on read

	16:12 <benblack> but it would not have very consistent performance

	16:13 <grantr> you'd have to do a lot of list_keys or mapred

	16:13 <benblack> no

	16:13 <benblack> you'd allow_multi, retrieve all siblings on read, and write
	back an ordered set of linked documents

	16:13 <benblack> or similar

	16:14 <grantr> oh i never thought of that

	16:14 <benblack> you'd be better off writing something for the purpose on
	riak_core, not on _kv

	16:14 <grantr> yeah i had a crazy idea

	16:14 <grantr> seems like it would be possible (though maybe not prudent) to wrap
	rabbit's queue implementation in riak core

	16:15 <grantr> you would probably lose strict ordering

	16:15 <benblack> been discussed.

	16:16 <grantr> any conclusions?

	16:16 <benblack> yes, it's hard.

	16:16 <grantr> damn ;)

	16:16 <grantr> hard to maintain queue semantics?

	16:18 <grantr> on the face of it, it seems relatively easy - routing table maintained by the
	ring, each vnode is assigned a queue or some shard of a queue
	(number of shards decided at queue creation time)

	16:19 <benblack> yeah, so, if you could never say "shard" again in the context of a
	dynamo system that'd be awesome

	16:20 <grantr> isnt that sort of how vnodes work though? segment of
	the keyspace == shard

	16:20 <benblack> no

	16:20 <benblack> it is really not

	16:20 <benblack> those are partitions

	16:20 <benblack> and they are integral to the design

	16:20 <benblack> not bolted on later

	16:20 <grantr> integral to the design of the vnode implementation?

	16:20 <benblack> integral to the design of dynamo

	16:21 <benblack> as in, it is even lower level than the storage system

	16:21 <grantr> hmmm

	16:21 <benblack> that's how riak_core can exist

	16:21 <grantr> i should read the paper again

	16:21 <benblack> you can shard mysql, right?

	16:22 <grantr> i guess i had this idea that's it's basically consistent hashing
	like with memcached, but managed by the cluster instead of the client

	16:22 <grantr> right

	16:22 <benblack> ok, or memcache consistent hashing. what is the equivalent of
	riak_core in those sharded setups?

	16:22 <grantr> gizzard i suppose. some proxy that sits in between the client and
	the nodes, and decides which node the request goes to

	16:23 <benblack> correct, the answer is "there is no equivalent"

	16:23 <benblack> you bolt some stuff on and hope for the best

	16:24 <grantr> what is riak core giving you that's more than a consistent
	hashing proxy then?

	16:24 <grantr> hinted handoff?

	16:24 <benblack> it's not a proxy

	16:24 <benblack> group membership, partitioning, hh, yes

	16:24 <benblack> online reconfiguration and redistribution

	16:24 <benblack> and, again, it isn't a proxy

	16:24 <grantr> shared state

	16:25 <benblack> the nodes are first class actors in the system

	16:25 <benblack> aware of each other

	16:25 <benblack> shards are generally not aware of each other

	16:26 <benblack> the mongo replica sets stuff is similarly bolted on. very limited
	set of nodes in a set, weird constraints because of their simplistic leader election,
	and, again, sets are not aware of other sets.

	16:27 <grantr> i see

	16:28 <benblack> witness the 4sq outage report

	16:28 <benblack> the glory of autosharding in full flower

	16:28 <grantr> i didnt even hear about that, looking

	16:29 <benblack> http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/

	16:33 <grantr> oh interesting, they're using mongo

	16:35 <grantr> so how could that have been avoided with a dynamo system? seems
	like a case of unbalanced load, which can happen in dynamo. i guess dynamo systems
	would be easier to rebalance?

	16:36 <benblack> you add more nodes, partitions are moved

	16:36 <pharkmillups> grantr: i can only speak for Riak, but the..

	16:36 <benblack> there's no special case

	16:36 <pharkmillups> yeah, what benblack said

	16:37 <benblack> pharkmillups has returned, i leave you in his capable hands

	16:37 <grantr> in riak thats true, in cassandra there is a manual step. but
	essentially the same thing. and you get that because rebalancing is built into the system

	16:37 <grantr> thanks benblack

	16:38 <grantr> i always learn something in this channel!

	16:39 <grantr> i should reread the dynamo paper with queues in mind

	16:39 <grantr> maybe there is a difficulty with partitioning that i'm missing

	16:41 <benblack> it is the eventual consistency that bites you on this one, generally,
	rather than partitioning.

	16:42 <grantr> got it

	16:42 <grantr> two pops at the same time to different nodes could get the same item

	16:44 <grantr> and i guess that's a result of all the reads and writes (deletes)
	]being in the same "place"

	16:45 <grantr> that's how it's different from the typical usage pattern of a kv store

	16:46 <grantr> hibari's chain replication could be a better system than dynamo
	then for replicated queues

	16:46 <grantr> anything with strong consistency