PharkMillups/gist:508693

## gistfile1.txt
12:57 <siculars> hey gang. so i was looking over the bitcask-intro.pdf
file again,
http://downloads.basho.com/papers/bitcask-intro.pdf.
if you look at the bitcask file layout on pgs 2-3 and
read the following on pg 3 "After the append completes,
an in-memory structure called a ”keydir” is updated." you
get an idea for whats going on under the hood. My question is
that in light of the issue of scanning all keys in a cluster for
a map/reduce , why don't you just store the bucket info in the row header
along with  the key? that way your in-memory 'keydir' could be filtered by
bucket or constructed differently.

12:59 <seancribbs> siculars: it is stored there, just not segregated from the key.

12:59 <seancribbs> you're essentially asking for a hash of hashes

13:00 <siculars> right , or some other mechanism. so the key column is basically
bucket/key munged together.

13:02 <seancribbs> yes

13:02 <siculars> if it's only there for uniqueness, its not really that helpful.
is it?

13:03 <seancribbs> actually it's more for lookups. i.e. "where is this piece of data
in bitcask's file structure"

13:30 <justinsheehy> siculars: the only thing missing from that analysis is
that bitcask doesn't know anything at all about buckets. it's just a
binary-key/binary-value store.

13:31 <justinsheehy> hence the bucket awareness being in the bitcask
'kv backend, which is basically the riak-to-bitcask bridge

13:33 <siculars> but you could also bin hash the bucket and store
it in a separate field , no ? just add two fields in bitcask for
bucket and bucket size? something like bsz | ksz | vsz | b | k | v

13:33 <siculars> i dunno, just trying to think of ways out of
the scan all keys in the cluster problem...

13:35 <justinsheehy> I am also thinking about that problem. :-)

13:35 <drev1> siculars: seems like that option would add Riak dependent
functionality to Bitcask

13:36 <justinsheehy> and yes, changing bitcask to be less general-purpose
would be one path if we wanted to go that way

13:36 <justinsheehy> like drev1 said: right now, bitcask knows nothing at all
about Riak. it's just a local k/v store.

13:37 <siculars> true true... the bitcask/bucket thing is gonna be a headache fd wise.

13:38 <justinsheehy> yep

13:39 <siculars> if you are currently using the hash(bucket/key) to create
your keys isn't there some way to branch your mem index by the bucket?

13:39 <justinsheehy> that way is pretty easy to do (and thus easy to commit
to having soon if there's nothing else) but has its own kind of pain. hence
still looking to see if there's a better way.

13:40 <siculars> data architectures of one bucket per user/date frequency,
etc. are gonna get burned.

13:42 <justinsheehy> if they have to do it that way, yes. but I am not yet
resigned to that. we'll see.

13:47 <justinsheehy> am hoping to not have to make bitcask too riak-aware,
which is where the tension  comes from. can definitely do it that way if we need
to, but will be more work (as right now bitcask is
entirely unaware of the contents of the key, etc.) and would also require
 either making an ongoing fork or else making bitcask less general-purpose useful.

13:48 <justinsheehy> I am confident that we'll find a good answer, especially
since we already know some "okay" answers. it'll just take a bit of work.

13:49 <drev1> another possibility is adding an arbitrary flag field like
memcache to Bitcask which could be used by Riak for bucket aware keys but that would
increase the size of the in memory key dir

13:52 <justinsheehy> yeah... a separate problem we're also looking to solve
is reducing/removing some of the RAM constraints imposed by bitcask. heh.

13:53 <justinsheehy> but it's an interesting idea. hm,

14:00 <siculars> true. redis has been doing a bunch of work to decrease
their mem footprint...
http://blog.zawodny.com/2010/07/25/1250000000-keyvalue-pairs-in-redis-2-0-0-rc3-on-a-32gb-machine/
	12:57 <siculars> hey gang. so i was looking over the bitcask-intro.pdf
	file again,
	http://downloads.basho.com/papers/bitcask-intro.pdf.
	if you look at the bitcask file layout on pgs 2-3 and
	read the following on pg 3 "After the append completes,
	an in-memory structure called a ”keydir” is updated." you
	get an idea for whats going on under the hood. My question is
	that in light of the issue of scanning all keys in a cluster for
	a map/reduce , why don't you just store the bucket info in the row header
	along with the key? that way your in-memory 'keydir' could be filtered by
	bucket or constructed differently.

	12:59 <seancribbs> siculars: it is stored there, just not segregated from the key.

	12:59 <seancribbs> you're essentially asking for a hash of hashes

	13:00 <siculars> right , or some other mechanism. so the key column is basically
	bucket/key munged together.

	13:02 <seancribbs> yes

	13:02 <siculars> if it's only there for uniqueness, its not really that helpful.
	is it?

	13:03 <seancribbs> actually it's more for lookups. i.e. "where is this piece of data
	in bitcask's file structure"

	13:30 <justinsheehy> siculars: the only thing missing from that analysis is
	that bitcask doesn't know anything at all about buckets. it's just a
	binary-key/binary-value store.

	13:31 <justinsheehy> hence the bucket awareness being in the bitcask
	'kv backend, which is basically the riak-to-bitcask bridge

	13:33 <siculars> but you could also bin hash the bucket and store
	it in a separate field , no ? just add two fields in bitcask for
	bucket and bucket size? something like bsz \| ksz \| vsz \| b \| k \| v

	13:33 <siculars> i dunno, just trying to think of ways out of
	the scan all keys in the cluster problem...

	13:35 <justinsheehy> I am also thinking about that problem. :-)

	13:35 <drev1> siculars: seems like that option would add Riak dependent
	functionality to Bitcask

	13:36 <justinsheehy> and yes, changing bitcask to be less general-purpose
	would be one path if we wanted to go that way

	13:36 <justinsheehy> like drev1 said: right now, bitcask knows nothing at all
	about Riak. it's just a local k/v store.

	13:37 <siculars> true true... the bitcask/bucket thing is gonna be a headache fd wise.

	13:38 <justinsheehy> yep

	13:39 <siculars> if you are currently using the hash(bucket/key) to create
	your keys isn't there some way to branch your mem index by the bucket?

	13:39 <justinsheehy> that way is pretty easy to do (and thus easy to commit
	to having soon if there's nothing else) but has its own kind of pain. hence
	still looking to see if there's a better way.

	13:40 <siculars> data architectures of one bucket per user/date frequency,
	etc. are gonna get burned.

	13:42 <justinsheehy> if they have to do it that way, yes. but I am not yet
	resigned to that. we'll see.

	13:47 <justinsheehy> am hoping to not have to make bitcask too riak-aware,
	which is where the tension comes from. can definitely do it that way if we need
	to, but will be more work (as right now bitcask is
	entirely unaware of the contents of the key, etc.) and would also require
	either making an ongoing fork or else making bitcask less general-purpose useful.

	13:48 <justinsheehy> I am confident that we'll find a good answer, especially
	since we already know some "okay" answers. it'll just take a bit of work.

	13:49 <drev1> another possibility is adding an arbitrary flag field like
	memcache to Bitcask which could be used by Riak for bucket aware keys but that would
	increase the size of the in memory key dir

	13:52 <justinsheehy> yeah... a separate problem we're also looking to solve
	is reducing/removing some of the RAM constraints imposed by bitcask. heh.

	13:53 <justinsheehy> but it's an interesting idea. hm,

	14:00 <siculars> true. redis has been doing a bunch of work to decrease
	their mem footprint...
	http://blog.zawodny.com/2010/07/25/1250000000-keyvalue-pairs-in-redis-2-0-0-rc3-on-a-32gb-machine/
No results found