PharkMillups/gist:720780

## gistfile1.txt
14:11 <quentusrex> Anyone know if there is a simple way to pull
documents(about 3 million documents) from a couchdb server into a
new riak cluster?

14:12 <benblack> the only way of which i am aware is reading the
documents and writing them to riak.

14:14 <quentusrex> thanks benblack I was afraid of that.

14:14 <benblack> why?

14:14 <benblack> seems like a pretty short shell script

14:14 <benblack> and 3 million is a pretty small number of documents

14:14 <quentusrex> yes, short. but time consuming based on the number of
documents

14:14 <quentusrex> oops,

14:14 <quentusrex> billion,

14:15 <benblack> 3 billion is rather more, yes

14:15 <quentusrex> not million

14:15 <benblack> surprised you have that in couchdb

14:15 <quentusrex> bigcouch

14:15 <benblack> that makes much more sense

14:16 <quentusrex> We fully moved the data from a sharded postgresql cluster,
to a bigcouch cluster

14:16 <quentusrex> But after disappointments, and some data loss, we are trying to test riak

14:21 <timanglade> benblack: Andy & I had talked about writing some sort
of binary format translator for use-cases like the one outlined by quentusrex.
Any obvious reasons why that wouldn't be possible?

14:22 <benblack> you mean converting the couch btrees to bitcask files?

14:23 <timanglade> obviously, you'd have to unmarshall & re-marshall the documents

14:23 <timanglade> but yeah

14:23 <benblack> no reason you couldn't do that in erlang

14:23 <timanglade> any other projects that could work with?

14:23 <benblack> the main problem being distribution around the ring

14:23 <timanglade> hm right this is where my limited knowledge of Riak stops.
But couldn't we initially load it all on one node and let the ring re-distribute it all?

14:24 <benblack> at EF SF i suggested having a relational db converter and
replication slave

14:24 <benblack> no

14:24 <quentusrex> I would be fine with the ability to tell each node in the
riak cluster to import documents from X to Y range

14:24 <benblack> you'd have to have a single node ring, load the data,
then add new nodes

14:24 <timanglade> benblack: yes

14:24 <timanglade> that was my idea

14:24 <benblack> quentusrex: doesn't work that way

14:24 <timanglade> quentusrex you'd be starting from scratch, right?

14:25 <benblack> quentusrex: placement in a given partition is based on
the sha1 hash, not the key

14:25 <quentusrex> At this point in time, just having the data into
riak safely would make my weekend.

14:25 <timanglade> still wondering if a binary format translation +
redistribution would be faster than a script export/import

14:26 <timanglade> depends how “parallel” the Riak re-distribution can
be, I guess

14:26 <quentusrex> are you talking about the binary files? or binary
interface to couchdb?

14:26 <timanglade> binary files

14:26 <quentusrex> aah, in that case I could just install a
riak node on the couchdb cluster,

14:27 <quentusrex> but my question would be how riak would handle one
node(the one installed on the couchdb cluster) constantly filling with data.

14:27 <quentusrex> At what point would the node start pushing the data off
of the local node to the other nodes in the cluster.

14:27 <timanglade> not well I'd think. As benblack said, you'd need to start with a single-node
riak. Then add more Riak node once the import is done.

14:28 <timanglade> again, not a Riak expert so I'll let somebody else confirm this

14:28 <quentusrex> that would not work well in my case.
No one node could handle all of the data.

14:28 <timanglade> but may I suggest giving Cloudant a try? ^^

14:28 <quentusrex> I am using Cloudant's bigcouch right now

14:28 <benblack> "aah, in that case I could just install a riak node on the couchdb cluster," again, no

14:29 <benblack> you cannot have mixed bigcouch/riak clusters

14:29 <benblack> converting as tim is proposing would be just the files

14:29 <acts_as> the one downside to using riak as my web crawler, instead
of hadoop, is that with no master -- I can't call it my crawl plantation.

14:29 <benblack> and would require boostrapping a cluster in a specific way

14:29 <benblack> acts_as: anarcho-syndicalist commune

14:29 <timanglade> quentusrex: did you share your issues with the Cloudant
guys already? I'm sure they’d love to know what your problems are, even
if you've already set your mind on moving to Riak.

14:30 <acts_as> that's getting there

14:30 <quentusrex> I am not the one who is working on the bigcouch issues.
I'm looking for a replacement. But I will make sure the issues get pushed
to the developers.

14:30 <timanglade> benblack: I think he was just trying to have the binary
file translation done locally on a single server, not do a mixed cluster

14:31 <timanglade> quentusrex: hit me up on tim@cloudant.com or anybody they’re used to dealing with, I'll make sure the appropriate people hear about it

14:31 <quentusrex> will do

14:31 <benblack> quentusrex: were you thinking above that you could
have a riak node participate in a couch cluster?

14:32 <quentusrex> no,

14:32 <quentusrex> Just to remove the network latency for the file translation.

14:32 <benblack> so, not in the cluster

14:32 <quentusrex> Not in the cluster, but on one of the nodes in the cluster.

14:32 <timanglade> quentusrex: so your data does not fit on a single
machine because of what… Storage limitations?

14:33 <benblack> given the data is not on that one node i doubt that buys you much

14:33 <timanglade> ie, disk space?

14:33 <quentusrex> timanglade, correct. It is possible to build a
server with drive space to store it,

14:33 <quentusrex> but our current hardware is limited to 750GB per server.

14:34 <benblack> how large are the documents?

14:34 <timanglade> 3B HTTP lookups is a lot to start with but if you also
add the constraint of moving petabyte data…

14:34 <benblack> each

14:35 <quentusrex> I would estimate about half are roughly 25kb,
a quarter are between 25kb and 250kb, and all but the largest 1% are
between 250kb and 1MB.

14:35 <quentusrex> With the largest at about 20MB.

14:35 <timanglade> that's not bad for an HTTP transfer

14:36 <benblack> how much total data?

14:36 <timanglade> benblack: how would riak react to all nodes but
one going down temporarily. And him adding more local binary files on
the one node that’s still up. Then reconnecting the other nodes?

14:36 <benblack> i would expect badly

14:36 <quentusrex> I can go find out, but those are the numbers I was given.

14:37 <benblack> instead, this would best be handled by dividing up
the key range on the couch side and having a bunch of writers go in parallel

14:37 <benblack> from couch to riak

14:37 <quentusrex> benblack, that was my hope. To split the range,
and then have specific riak nodes pull their range into the riak cluster.

14:38 <timanglade> … if the network arch is not the bottleneck, yeah.

14:38 <benblack> whehter it is push or pull

14:38 <benblack> that division of range is up to you

14:38 <benblack> and you could not do it in such a way as to make a riak
node only pull the documents for which it would be responsible

14:38 <quentusrex> timanglade, the plan was to run gigE from the
riak+couchdb nodes to the riak cluster. Since each server has dual nic.

14:39 <benblack> even if you max it out, that is a lot of data to shove through a gig link

14:39 <timanglade> wait, why would you still have shared nodes?

14:39 <quentusrex> At the moment I'm not aware if the bigcouch nodes are
sharded or not.

14:40 <timanglade> no, shared riak+couchdb

14:40 <quentusrex> We have physically separate racks for production and sandbox

14:40 <timanglade> although I guess localhost transfers could be faster?

14:40 <benblack> shared and sharded are not the same ;)

14:40 <quentusrex> and the couchdb cluster is in production atm.

14:41 <benblack> split your key range yourself

14:41 <benblack> run transfer processes in parallel

14:41 <quentusrex> from couchdb to riak would be localhost(both on the same server),
and from riak to riak would be gigE.

14:42 <benblack> the couchdb to riak being localhost helps you little, imo.

14:42 <benblack> because, again, can you guarantee the key range is entirely
on the local machine?

14:42 <timanglade> benblack: don't all network implementation bypass the
physical network interface on localhost?

14:42 <quentusrex> would riak be able to tell if it is saturating the
riak to riak link?

14:42 <benblack> because, again, can you guarantee the key range
is entirely on the local machine?

14:42 <benblack> no

14:43 <quentusrex> benblack, at the moment no.

14:43 <benblack> right

14:43 <benblack> so the advantage of localhost transfer is pretty small, imo

14:43 <timanglade> maybe I'm left field but wouldn't dedicated network links help?

14:44 <timanglade> I'm sure internal couchdb and riak cluster communications are bad enough for your network to start with

14:44 <timanglade> but running two in parallel on the same network…

14:44 <quentusrex> timanglade, yes. that would help having a dedicated
network path from the couchdb network to the riak network.

14:45 <timanglade> but maybe it's negligible. Not sure what the network
usage patterns & resiliences are

14:45 <timanglade> compared to a load-balanced double gigE connection
that you just shove everything on in parallel

14:48 <timanglade> quentusrex: let me check with Cloudant to see if
(a) you could somehow force the BigCouch partitioning to match
the expected Riak partitioning. Then maybe you could either b) t
ranslate the binary files locally on each machine. Or at least c)
only do localhost transfers on each node separately. benblack, would
that work, assuming condition (a) is met?

14:48 <benblack> extraordinarily unlikely

14:49 <benblack> but have at it :)

14:49 <timanglade> you mean (a) is unlikely?
14:50 <benblack> yes

14:51 <benblack> if it could be done, it would require configuring
riak in a specific way that may or may not be what is desired and
would be really hard to change (ring_creation_size)

14:51 <timanglade> I concur but I don't know for sure that it's
impossible so I'm checking right now with the core team. I'm
just the marketing guy after all…

14:52 <timanglade> mostly, I'm just angry that NOSQL interop is as
bad as it is. I’m sure we could  all work towards making it a bit easier…

14:53 davidc_ joined

14:53 <timanglade> quentusrex: at this point, if time is really
your issues, your best bet would be to start that export/import
script right now and hope it completes by Sunday, imo. If you
figure out an optimal network conf, it might work out in time.

14:54 <quentusrex> timanglade, it's been running since yesterday.

14:54 <quentusrex> I was looking for a faster way to finish off the rest.

14:54 <timanglade> quentusrex: ah. What's the progress rate?

14:54 <quentusrex> We forgot to put a progress indicator in the script.

14:55 <timanglade> can't you figure it out roughly with the disk usage?

14:55 <quentusrex> but I think we'll finish sometime soon.

14:55 <quentusrex> within the next day or so

14:55 <timanglade> so what, about 72 hours for how much? 1PB?

14:55 <quentusrex> It wasn't my script, and I'm not on that screen.

14:56 <timanglade> ok, I was just interested to know how practical
those migrations really are

14:56 <quentusrex> Not that much data I don't think

14:56 <benblack> are you running in parallel?

14:56 <quentusrex> atleast not when compressed

14:56 <quentusrex> benblack, 5 parallel instances of the script

14:56 <benblack> are you saturating the network?

14:56 <quentusrex> not the cross link

14:57 <benblack> then run more

14:57 <timanglade> unless he’s overloading the CPU of the CouchDB instances ;)

14:57 <benblack> MOAR PARALLELZ

14:59 <timanglade> benblack: you got a fever?

14:59 <benblack> and the only prescription

14:59 <benblack> is more bandwidth

14:59 <timanglade> is moar parallelz

14:59 <timanglade> ;)

15:00 <timanglade> quentusrex: at any rate, make sure to let me
know how that migration worked out. And if you can have your other
guys point out the issues you had, we want to know.

15:01 <timanglade> benblack: NOSQL has an answer to SQL’s “buy a bigger box”. It's “buy more bandwidth”.

15:02 <timanglade> cheaper in theory but I'm not sure how practical it is…

15:04 <benblack> sql requires a ton of bandwidth and low latency to do this, too

15:04 <benblack> see clustrix

15:04 <timanglade> true. It's just that I see internal bandwidth
requirements as a constant improvement factor in NOSQL systems,
on a bunch of different use-cases.

15:05 <timanglade> which makes total sense considering the design

15:05 <quentusrex> timanglade, benblack I mispoke earlier. each
server is a 7TB hard drive space 8 1TB drives in raid 5.

15:05 <timanglade> it's just another part of standard datacenter
infrastructure that could evolve with NOSQL rising.

15:06 <benblack> raid5? yikes

15:06 <benblack> and that is a _lot_ of data per node

15:06 <benblack> can i suggest your hardware is optimized for
something other than a dynamo system?

15:07 <quentusrex> Rough back of the napkin data calculations:

15:07 <quentusrex> 1.5 billion documents at 25kb = 35.7 TB

15:07 <quentusrex> 750 million documents at 100kb = 71.5 TB

15:07 <quentusrex> 720 million documents at 600kb = 412 TB

15:07 <quentusrex> 30 million documents at 20MB = 586 TB

15:08 <benblack> your hardware selection is not a great choice, imo

15:09 <quentusrex> Agreed.

15:10 <benblack> if you want fat nodes, you should have 10G ethernet

15:10 <benblack> on 1G, you need skinny nodes

15:12 <quentusrex> Know of any good 10G switches?

15:13 <benblack> depends on your budget

15:27 <timanglade> quentusrex, benblack: I was told that it would be
technically possible to hot-remap a BigCouch cluster according to an SHA-1 map

15:27 <timanglade> involves going deep in the code though

15:28 <timanglade> but could be applied to an existing cluster. Hijinks
would probably ensue…

15:32 <quentusrex> The effort would probably be better spent working on
import/export for riak

15:33 <timanglade> yes

15:33 <benblack> timanglade: and would have to inspect the riak ring and
make it match and would require the same partition count in riak

15:34 <timanglade> at that point, I was just interested to know if it was possible

15:34 <timanglade> doesn't just having to calculate 3B object SHA-1
and mapping them make this approach not worth it?

15:36 <timanglade> you'd have to write some much code to distribute
that part of the job… Then plug that into BigCouch somehow. Change the
code. Wait for BigCouch to rebalance. Write a binary translator in
Erlang / transfer over localhost. Configure Riak.

15:36 <timanglade> All the while praying to Saint Jude

15:41 <quentusrex> is there a riak ui similiar to futon?

15:45 <benblack> not to my knowledge

16:35 <quentusrex> I'm getting a 400 bad request from the
following curl request: curl -v -X PUT -d '{"bar":"baz"}' -H "Content-Type: application/json" http://riak1:8098/riak/test

16:37 <quentusrex> trying to add an object to the 'test' bucket, and have
riak generate the key

16:42 <quentusrex> timanglade, it sounds like the biggest issue
with bigcouch was the lack of 'per bucket NRW properties'

16:43 <timanglade> quentusrex: what about the data loss you referred to?

16:44 <quentusrex> timanglade, we lost a couple servers, and the data was lost with them.

16:44 <quentusrex> bigcouch did not lose the data,

16:44 <timanglade> oh, it's you nrw that was not appropriate set then

16:44 <quentusrex> right.

16:45 <quentusrex> We have a ton of data that can be recalculated,
and a 'decent amount' that must be kept no matter how bad the clusters health.

16:46 <timanglade> hm

16:46 <quentusrex> not the only reason, but a big one.

16:46 <timanglade> you can set nrw for each database though

16:46 <timanglade> which would be the BigCouch equivalent of a “bucket”
	14:11 <quentusrex> Anyone know if there is a simple way to pull
	documents(about 3 million documents) from a couchdb server into a
	new riak cluster?

	14:12 <benblack> the only way of which i am aware is reading the
	documents and writing them to riak.

	14:14 <quentusrex> thanks benblack I was afraid of that.

	14:14 <benblack> why?

	14:14 <benblack> seems like a pretty short shell script

	14:14 <benblack> and 3 million is a pretty small number of documents

	14:14 <quentusrex> yes, short. but time consuming based on the number of
	documents

	14:14 <quentusrex> oops,

	14:14 <quentusrex> billion,

	14:15 <benblack> 3 billion is rather more, yes

	14:15 <quentusrex> not million

	14:15 <benblack> surprised you have that in couchdb

	14:15 <quentusrex> bigcouch

	14:15 <benblack> that makes much more sense

	14:16 <quentusrex> We fully moved the data from a sharded postgresql cluster,
	to a bigcouch cluster

	14:16 <quentusrex> But after disappointments, and some data loss, we are trying to test riak

	14:21 <timanglade> benblack: Andy & I had talked about writing some sort
	of binary format translator for use-cases like the one outlined by quentusrex.
	Any obvious reasons why that wouldn't be possible?

	14:22 <benblack> you mean converting the couch btrees to bitcask files?

	14:23 <timanglade> obviously, you'd have to unmarshall & re-marshall the documents

	14:23 <timanglade> but yeah

	14:23 <benblack> no reason you couldn't do that in erlang

	14:23 <timanglade> any other projects that could work with?

	14:23 <benblack> the main problem being distribution around the ring

	14:23 <timanglade> hm right this is where my limited knowledge of Riak stops.
	But couldn't we initially load it all on one node and let the ring re-distribute it all?

	14:24 <benblack> at EF SF i suggested having a relational db converter and
	replication slave

	14:24 <benblack> no

	14:24 <quentusrex> I would be fine with the ability to tell each node in the
	riak cluster to import documents from X to Y range

	14:24 <benblack> you'd have to have a single node ring, load the data,
	then add new nodes

	14:24 <timanglade> benblack: yes

	14:24 <timanglade> that was my idea

	14:24 <benblack> quentusrex: doesn't work that way

	14:24 <timanglade> quentusrex you'd be starting from scratch, right?

	14:25 <benblack> quentusrex: placement in a given partition is based on
	the sha1 hash, not the key

	14:25 <quentusrex> At this point in time, just having the data into
	riak safely would make my weekend.

	14:25 <timanglade> still wondering if a binary format translation +
	redistribution would be faster than a script export/import

	14:26 <timanglade> depends how “parallel” the Riak re-distribution can
	be, I guess

	14:26 <quentusrex> are you talking about the binary files? or binary
	interface to couchdb?

	14:26 <timanglade> binary files

	14:26 <quentusrex> aah, in that case I could just install a
	riak node on the couchdb cluster,

	14:27 <quentusrex> but my question would be how riak would handle one
	node(the one installed on the couchdb cluster) constantly filling with data.

	14:27 <quentusrex> At what point would the node start pushing the data off
	of the local node to the other nodes in the cluster.

	14:27 <timanglade> not well I'd think. As benblack said, you'd need to start with a single-node
	riak. Then add more Riak node once the import is done.

	14:28 <timanglade> again, not a Riak expert so I'll let somebody else confirm this

	14:28 <quentusrex> that would not work well in my case.
	No one node could handle all of the data.

	14:28 <timanglade> but may I suggest giving Cloudant a try? ^^

	14:28 <quentusrex> I am using Cloudant's bigcouch right now

	14:28 <benblack> "aah, in that case I could just install a riak node on the couchdb cluster," again, no

	14:29 <benblack> you cannot have mixed bigcouch/riak clusters

	14:29 <benblack> converting as tim is proposing would be just the files

	14:29 <acts_as> the one downside to using riak as my web crawler, instead
	of hadoop, is that with no master -- I can't call it my crawl plantation.

	14:29 <benblack> and would require boostrapping a cluster in a specific way

	14:29 <benblack> acts_as: anarcho-syndicalist commune

	14:29 <timanglade> quentusrex: did you share your issues with the Cloudant
	guys already? I'm sure they’d love to know what your problems are, even
	if you've already set your mind on moving to Riak.

	14:30 <acts_as> that's getting there

	14:30 <quentusrex> I am not the one who is working on the bigcouch issues.
	I'm looking for a replacement. But I will make sure the issues get pushed
	to the developers.

	14:30 <timanglade> benblack: I think he was just trying to have the binary
	file translation done locally on a single server, not do a mixed cluster

	14:31 <timanglade> quentusrex: hit me up on tim@cloudant.com or anybody they’re used to dealing with, I'll make sure the appropriate people hear about it

	14:31 <quentusrex> will do

	14:31 <benblack> quentusrex: were you thinking above that you could
	have a riak node participate in a couch cluster?

	14:32 <quentusrex> no,

	14:32 <quentusrex> Just to remove the network latency for the file translation.

	14:32 <benblack> so, not in the cluster

	14:32 <quentusrex> Not in the cluster, but on one of the nodes in the cluster.

	14:32 <timanglade> quentusrex: so your data does not fit on a single
	machine because of what… Storage limitations?

	14:33 <benblack> given the data is not on that one node i doubt that buys you much

	14:33 <timanglade> ie, disk space?

	14:33 <quentusrex> timanglade, correct. It is possible to build a
	server with drive space to store it,

	14:33 <quentusrex> but our current hardware is limited to 750GB per server.

	14:34 <benblack> how large are the documents?

	14:34 <timanglade> 3B HTTP lookups is a lot to start with but if you also
	add the constraint of moving petabyte data…

	14:34 <benblack> each

	14:35 <quentusrex> I would estimate about half are roughly 25kb,
	a quarter are between 25kb and 250kb, and all but the largest 1% are
	between 250kb and 1MB.

	14:35 <quentusrex> With the largest at about 20MB.

	14:35 <timanglade> that's not bad for an HTTP transfer

	14:36 <benblack> how much total data?

	14:36 <timanglade> benblack: how would riak react to all nodes but
	one going down temporarily. And him adding more local binary files on
	the one node that’s still up. Then reconnecting the other nodes?

	14:36 <benblack> i would expect badly

	14:36 <quentusrex> I can go find out, but those are the numbers I was given.

	14:37 <benblack> instead, this would best be handled by dividing up
	the key range on the couch side and having a bunch of writers go in parallel

	14:37 <benblack> from couch to riak

	14:37 <quentusrex> benblack, that was my hope. To split the range,
	and then have specific riak nodes pull their range into the riak cluster.

	14:38 <timanglade> … if the network arch is not the bottleneck, yeah.

	14:38 <benblack> whehter it is push or pull

	14:38 <benblack> that division of range is up to you

	14:38 <benblack> and you could not do it in such a way as to make a riak
	node only pull the documents for which it would be responsible

	14:38 <quentusrex> timanglade, the plan was to run gigE from the
	riak+couchdb nodes to the riak cluster. Since each server has dual nic.

	14:39 <benblack> even if you max it out, that is a lot of data to shove through a gig link

	14:39 <timanglade> wait, why would you still have shared nodes?

	14:39 <quentusrex> At the moment I'm not aware if the bigcouch nodes are
	sharded or not.

	14:40 <timanglade> no, shared riak+couchdb

	14:40 <quentusrex> We have physically separate racks for production and sandbox

	14:40 <timanglade> although I guess localhost transfers could be faster?

	14:40 <benblack> shared and sharded are not the same ;)

	14:40 <quentusrex> and the couchdb cluster is in production atm.

	14:41 <benblack> split your key range yourself

	14:41 <benblack> run transfer processes in parallel

	14:41 <quentusrex> from couchdb to riak would be localhost(both on the same server),
	and from riak to riak would be gigE.

	14:42 <benblack> the couchdb to riak being localhost helps you little, imo.

	14:42 <benblack> because, again, can you guarantee the key range is entirely
	on the local machine?

	14:42 <timanglade> benblack: don't all network implementation bypass the
	physical network interface on localhost?

	14:42 <quentusrex> would riak be able to tell if it is saturating the
	riak to riak link?

	14:42 <benblack> because, again, can you guarantee the key range
	is entirely on the local machine?

	14:42 <benblack> no

	14:43 <quentusrex> benblack, at the moment no.

	14:43 <benblack> right

	14:43 <benblack> so the advantage of localhost transfer is pretty small, imo

	14:43 <timanglade> maybe I'm left field but wouldn't dedicated network links help?

	14:44 <timanglade> I'm sure internal couchdb and riak cluster communications are bad enough for your network to start with

	14:44 <timanglade> but running two in parallel on the same network…

	14:44 <quentusrex> timanglade, yes. that would help having a dedicated
	network path from the couchdb network to the riak network.

	14:45 <timanglade> but maybe it's negligible. Not sure what the network
	usage patterns & resiliences are

	14:45 <timanglade> compared to a load-balanced double gigE connection
	that you just shove everything on in parallel

	14:48 <timanglade> quentusrex: let me check with Cloudant to see if
	(a) you could somehow force the BigCouch partitioning to match
	the expected Riak partitioning. Then maybe you could either b) t
	ranslate the binary files locally on each machine. Or at least c)
	only do localhost transfers on each node separately. benblack, would
	that work, assuming condition (a) is met?

	14:48 <benblack> extraordinarily unlikely

	14:49 <benblack> but have at it :)

	14:49 <timanglade> you mean (a) is unlikely?
	14:50 <benblack> yes

	14:51 <benblack> if it could be done, it would require configuring
	riak in a specific way that may or may not be what is desired and
	would be really hard to change (ring_creation_size)

	14:51 <timanglade> I concur but I don't know for sure that it's
	impossible so I'm checking right now with the core team. I'm
	just the marketing guy after all…

	14:52 <timanglade> mostly, I'm just angry that NOSQL interop is as
	bad as it is. I’m sure we could all work towards making it a bit easier…

	14:53 davidc_ joined

	14:53 <timanglade> quentusrex: at this point, if time is really
	your issues, your best bet would be to start that export/import
	script right now and hope it completes by Sunday, imo. If you
	figure out an optimal network conf, it might work out in time.

	14:54 <quentusrex> timanglade, it's been running since yesterday.

	14:54 <quentusrex> I was looking for a faster way to finish off the rest.

	14:54 <timanglade> quentusrex: ah. What's the progress rate?

	14:54 <quentusrex> We forgot to put a progress indicator in the script.

	14:55 <timanglade> can't you figure it out roughly with the disk usage?

	14:55 <quentusrex> but I think we'll finish sometime soon.

	14:55 <quentusrex> within the next day or so

	14:55 <timanglade> so what, about 72 hours for how much? 1PB?

	14:55 <quentusrex> It wasn't my script, and I'm not on that screen.

	14:56 <timanglade> ok, I was just interested to know how practical
	those migrations really are

	14:56 <quentusrex> Not that much data I don't think

	14:56 <benblack> are you running in parallel?

	14:56 <quentusrex> atleast not when compressed

	14:56 <quentusrex> benblack, 5 parallel instances of the script

	14:56 <benblack> are you saturating the network?

	14:56 <quentusrex> not the cross link

	14:57 <benblack> then run more

	14:57 <timanglade> unless he’s overloading the CPU of the CouchDB instances ;)

	14:57 <benblack> MOAR PARALLELZ

	14:59 <timanglade> benblack: you got a fever?

	14:59 <benblack> and the only prescription

	14:59 <benblack> is more bandwidth

	14:59 <timanglade> is moar parallelz

	14:59 <timanglade> ;)

	15:00 <timanglade> quentusrex: at any rate, make sure to let me
	know how that migration worked out. And if you can have your other
	guys point out the issues you had, we want to know.

	15:01 <timanglade> benblack: NOSQL has an answer to SQL’s “buy a bigger box”. It's “buy more bandwidth”.

	15:02 <timanglade> cheaper in theory but I'm not sure how practical it is…

	15:04 <benblack> sql requires a ton of bandwidth and low latency to do this, too

	15:04 <benblack> see clustrix

	15:04 <timanglade> true. It's just that I see internal bandwidth
	requirements as a constant improvement factor in NOSQL systems,
	on a bunch of different use-cases.

	15:05 <timanglade> which makes total sense considering the design

	15:05 <quentusrex> timanglade, benblack I mispoke earlier. each
	server is a 7TB hard drive space 8 1TB drives in raid 5.

	15:05 <timanglade> it's just another part of standard datacenter
	infrastructure that could evolve with NOSQL rising.

	15:06 <benblack> raid5? yikes

	15:06 <benblack> and that is a _lot_ of data per node

	15:06 <benblack> can i suggest your hardware is optimized for
	something other than a dynamo system?

	15:07 <quentusrex> Rough back of the napkin data calculations:

	15:07 <quentusrex> 1.5 billion documents at 25kb = 35.7 TB

	15:07 <quentusrex> 750 million documents at 100kb = 71.5 TB

	15:07 <quentusrex> 720 million documents at 600kb = 412 TB

	15:07 <quentusrex> 30 million documents at 20MB = 586 TB

	15:08 <benblack> your hardware selection is not a great choice, imo

	15:09 <quentusrex> Agreed.

	15:10 <benblack> if you want fat nodes, you should have 10G ethernet

	15:10 <benblack> on 1G, you need skinny nodes

	15:12 <quentusrex> Know of any good 10G switches?

	15:13 <benblack> depends on your budget

	15:27 <timanglade> quentusrex, benblack: I was told that it would be
	technically possible to hot-remap a BigCouch cluster according to an SHA-1 map

	15:27 <timanglade> involves going deep in the code though

	15:28 <timanglade> but could be applied to an existing cluster. Hijinks
	would probably ensue…

	15:32 <quentusrex> The effort would probably be better spent working on
	import/export for riak

	15:33 <timanglade> yes

	15:33 <benblack> timanglade: and would have to inspect the riak ring and
	make it match and would require the same partition count in riak

	15:34 <timanglade> at that point, I was just interested to know if it was possible

	15:34 <timanglade> doesn't just having to calculate 3B object SHA-1
	and mapping them make this approach not worth it?

	15:36 <timanglade> you'd have to write some much code to distribute
	that part of the job… Then plug that into BigCouch somehow. Change the
	code. Wait for BigCouch to rebalance. Write a binary translator in
	Erlang / transfer over localhost. Configure Riak.

	15:36 <timanglade> All the while praying to Saint Jude

	15:41 <quentusrex> is there a riak ui similiar to futon?

	15:45 <benblack> not to my knowledge

	16:35 <quentusrex> I'm getting a 400 bad request from the
	following curl request: curl -v -X PUT -d '{"bar":"baz"}' -H "Content-Type: application/json" http://riak1:8098/riak/test

	16:37 <quentusrex> trying to add an object to the 'test' bucket, and have
	riak generate the key

	16:42 <quentusrex> timanglade, it sounds like the biggest issue
	with bigcouch was the lack of 'per bucket NRW properties'

	16:43 <timanglade> quentusrex: what about the data loss you referred to?

	16:44 <quentusrex> timanglade, we lost a couple servers, and the data was lost with them.

	16:44 <quentusrex> bigcouch did not lose the data,

	16:44 <timanglade> oh, it's you nrw that was not appropriate set then

	16:44 <quentusrex> right.

	16:45 <quentusrex> We have a ton of data that can be recalculated,
	and a 'decent amount' that must be kept no matter how bad the clusters health.

	16:46 <timanglade> hm

	16:46 <quentusrex> not the only reason, but a big one.

	16:46 <timanglade> you can set nrw for each database though

	16:46 <timanglade> which would be the BigCouch equivalent of a “bucket”