PharkMillups/gist:484979

## gistfile1.txt
11:29 <copious> hi all, I was working up some number for capacity planning and
came up with these estimates, does this look right to folks ?

11:29 <copious> http://spreadsheets.google.com/ccc?key=0Amb23lPTCzBkdGlPYnBRaUotWlUxOGtFdHJGTEMzcWc&hl=en&authkey=CMSRiAs#gid=0

11:31 <seancribbs> yes, as long as you're considering those minimums

11:34 <seancribbs> what's the RAM/Disk in your current setup like?

11:38 <copious> currently were at 2 machines 8GB of ram each w/ 2TB of disk each

11:38 <copious> moving to 2 machines w/ 32G of ram each w/ 4TB disk each

11:38 <seancribbs> oh, i meant the system you're replacing, but that's cool

11:38 <copious> this is the stopgap measure before moving to a more comprehensive and
 linearlly scalable solution

11:39 <copious> which is where riak is a candidate

11:39 <seancribbs> keep in mind 2 machines at N=3 is going to be degenerate
in a lot of ways

11:39 <copious> I'm assuming that with W=3 we would have at least 3 machines

11:40 <seancribbs> yeah

11:41 <copious> the part I don't like, is with these estimates, to store our current #
of documents (2+ Billion) I would need at least 2x that right column

11:41 <copious> which seems a little excessive

11:46 <seancribbs> copious: fwiw, 6 nodes for 2 Bn documents seems small to me

11:47 <seancribbs> i would be concerned about I/O throughput at that size

11:51 <copious> well those 96 and 128 GB lines are mostly pipe dreams,
more realistically, I would think you would want a good chunk of ram to be
available for disk cache, so that is more in the 32/48 ram line

11:51 <copious> so lets call it 12 machines for 2Billion docs.

11:52 <copious> to me that seems large, probably because of what we
have worked with just 2 machines that satisfy the same load

11:52 <seancribbs> So, what I'm concerned about is the partition size

11:52 <* copious> listens

11:52 <seancribbs> copious: but with the same amount of replication?

11:53 <seancribbs> anyway

11:53 <copious> W=2 effectively

11:53 <seancribbs> ok. regarding partition/ring size. you're going to
get better throughput if you have more partitions but you have to balance
that against the physical size of your cluster. too many partitions/node
gets problematic

11:57 <copious> what is a good balance of partitions / node ?

11:57 <seancribbs> between 10 and 50

11:59 <copious> and there is a size limitation per partiion ? or how does that all fit
together

12:00 <seancribbs> no, just think of partitions/vnodes as bottlenecks.
when you have an adequate number, your throughput is good

12:00 <copious> ok

12:01 <seancribbs> if you have too few, you'll have low throughput,
if you have too many, you could overload a node

12:01 <seancribbs> anyway, that's a real simplified version of it

12:01 <copious> sounds good. another question. this one came up when
I was explaining riak to a coworker. What is the cost of rebalancing?
say I have the 12 machines, and the capacity is getting up there and
its time to add new machines to the cluster

12:02 <seancribbs> time to transfer the partitions off disk and
across the network, plus a little. you can transfer up to 4
partitions at a time by default. so at 12 machines, say you have 256 ring size

12:03 <copious> how much does the transfer of the partitions
negatively affect the normal operation ?

12:04 <copious> 256 ring size, add one new machine...

12:04 <seancribbs> so that's a transfer of 19 or 20 partitions.
from among the existing 12 nodes. so… handoff doesn't really hurt normal ops.
but in small clusters you can sometimes get "not found" responses during
handoff

12:08 <copious> would the "not found" happen randomly or does it
happn more with recent or older docs or some other factor ?

12:08 <seancribbs> it happens mostly when basic quorum collides with
partitions in-transit

12:08 <copious> ok

12:08 <seancribbs> but i've only really seen it in small clusters, i.e. < 4 nodes

12:17 <copious> so back of the envelope, 20 partitions to transfer,
each @ ~54 Gib (1Tb per node w/ 21 partitions per node) when 1 machine is added

12:18 <copious> at about 1Gbyte / sec transfer we're lookinag ~20 hours to rebalance,
that sound abour right ?

12:18 <seancribbs> ick, yeah. so, the flip side of this is that you
 could go with more, smaller machines. increase the ring size.
 54GB per partition sounds like a lot

12:19 <holoway> exactly - if re-balancing time is the concern,
you need to have more data storage nodes

12:23 <copious> I wouldn't say that re-balancing is the concern,
just running some numbers

12:24 <copious> so what is the right size for a partition

12:31 <seancribbs> copious: i would hope much smaller than 54GB…
but I don't know for sure

12:31 <seancribbs> i.e. no real-world data points

15:00 <seancribbs> copious: I was thinking, it would be good to
use basho_bench to preload a bunch of keys to simulate your production
data size, then benchmark something resembling your production load
(in terms of throughput and key-hotness)

15:01 <seancribbs> but maybe start small, then increase the data size
until you start to see drop off

15:02 <seancribbs> the challenge with it all is that there are so many variable

15:45 <copious> seancribbs: our data is pretty LRU, that is, the data that was
inserted into the system most recently is what will be utilized the heaviest
for probably the next day or so.

15:45 <copious> after that, there is a huge drop off, which is why putting
varnish in front of it works out really well for us.
	11:29 <copious> hi all, I was working up some number for capacity planning and
	came up with these estimates, does this look right to folks ?

	11:29 <copious> http://spreadsheets.google.com/ccc?key=0Amb23lPTCzBkdGlPYnBRaUotWlUxOGtFdHJGTEMzcWc&hl=en&authkey=CMSRiAs#gid=0

	11:31 <seancribbs> yes, as long as you're considering those minimums

	11:34 <seancribbs> what's the RAM/Disk in your current setup like?

	11:38 <copious> currently were at 2 machines 8GB of ram each w/ 2TB of disk each

	11:38 <copious> moving to 2 machines w/ 32G of ram each w/ 4TB disk each

	11:38 <seancribbs> oh, i meant the system you're replacing, but that's cool

	11:38 <copious> this is the stopgap measure before moving to a more comprehensive and
	linearlly scalable solution

	11:39 <copious> which is where riak is a candidate

	11:39 <seancribbs> keep in mind 2 machines at N=3 is going to be degenerate
	in a lot of ways

	11:39 <copious> I'm assuming that with W=3 we would have at least 3 machines

	11:40 <seancribbs> yeah

	11:41 <copious> the part I don't like, is with these estimates, to store our current #
	of documents (2+ Billion) I would need at least 2x that right column

	11:41 <copious> which seems a little excessive

	11:46 <seancribbs> copious: fwiw, 6 nodes for 2 Bn documents seems small to me

	11:47 <seancribbs> i would be concerned about I/O throughput at that size

	11:51 <copious> well those 96 and 128 GB lines are mostly pipe dreams,
	more realistically, I would think you would want a good chunk of ram to be
	available for disk cache, so that is more in the 32/48 ram line

	11:51 <copious> so lets call it 12 machines for 2Billion docs.

	11:52 <copious> to me that seems large, probably because of what we
	have worked with just 2 machines that satisfy the same load

	11:52 <seancribbs> So, what I'm concerned about is the partition size

	11:52 <* copious> listens

	11:52 <seancribbs> copious: but with the same amount of replication?

	11:53 <seancribbs> anyway

	11:53 <copious> W=2 effectively

	11:53 <seancribbs> ok. regarding partition/ring size. you're going to
	get better throughput if you have more partitions but you have to balance
	that against the physical size of your cluster. too many partitions/node
	gets problematic

	11:57 <copious> what is a good balance of partitions / node ?

	11:57 <seancribbs> between 10 and 50

	11:59 <copious> and there is a size limitation per partiion ? or how does that all fit
	together

	12:00 <seancribbs> no, just think of partitions/vnodes as bottlenecks.
	when you have an adequate number, your throughput is good

	12:00 <copious> ok

	12:01 <seancribbs> if you have too few, you'll have low throughput,
	if you have too many, you could overload a node

	12:01 <seancribbs> anyway, that's a real simplified version of it

	12:01 <copious> sounds good. another question. this one came up when
	I was explaining riak to a coworker. What is the cost of rebalancing?
	say I have the 12 machines, and the capacity is getting up there and
	its time to add new machines to the cluster

	12:02 <seancribbs> time to transfer the partitions off disk and
	across the network, plus a little. you can transfer up to 4
	partitions at a time by default. so at 12 machines, say you have 256 ring size

	12:03 <copious> how much does the transfer of the partitions
	negatively affect the normal operation ?

	12:04 <copious> 256 ring size, add one new machine...

	12:04 <seancribbs> so that's a transfer of 19 or 20 partitions.
	from among the existing 12 nodes. so… handoff doesn't really hurt normal ops.
	but in small clusters you can sometimes get "not found" responses during
	handoff

	12:08 <copious> would the "not found" happen randomly or does it
	happn more with recent or older docs or some other factor ?

	12:08 <seancribbs> it happens mostly when basic quorum collides with
	partitions in-transit

	12:08 <copious> ok

	12:08 <seancribbs> but i've only really seen it in small clusters, i.e. < 4 nodes

	12:17 <copious> so back of the envelope, 20 partitions to transfer,
	each @ ~54 Gib (1Tb per node w/ 21 partitions per node) when 1 machine is added

	12:18 <copious> at about 1Gbyte / sec transfer we're lookinag ~20 hours to rebalance,
	that sound abour right ?

	12:18 <seancribbs> ick, yeah. so, the flip side of this is that you
	could go with more, smaller machines. increase the ring size.
	54GB per partition sounds like a lot

	12:19 <holoway> exactly - if re-balancing time is the concern,
	you need to have more data storage nodes

	12:23 <copious> I wouldn't say that re-balancing is the concern,
	just running some numbers

	12:24 <copious> so what is the right size for a partition

	12:31 <seancribbs> copious: i would hope much smaller than 54GB…
	but I don't know for sure

	12:31 <seancribbs> i.e. no real-world data points

	15:00 <seancribbs> copious: I was thinking, it would be good to
	use basho_bench to preload a bunch of keys to simulate your production
	data size, then benchmark something resembling your production load
	(in terms of throughput and key-hotness)

	15:01 <seancribbs> but maybe start small, then increase the data size
	until you start to see drop off

	15:02 <seancribbs> the challenge with it all is that there are so many variable

	15:45 <copious> seancribbs: our data is pretty LRU, that is, the data that was
	inserted into the system most recently is what will be utilized the heaviest
	for probably the next day or so.

	15:45 <copious> after that, there is a huge drop off, which is why putting
	varnish in front of it works out really well for us.