Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created July 21, 2010 19:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PharkMillups/484979 to your computer and use it in GitHub Desktop.
Save PharkMillups/484979 to your computer and use it in GitHub Desktop.
11:29 <copious> hi all, I was working up some number for capacity planning and
came up with these estimates, does this look right to folks ?
11:29 <copious> http://spreadsheets.google.com/ccc?key=0Amb23lPTCzBkdGlPYnBRaUotWlUxOGtFdHJGTEMzcWc&hl=en&authkey=CMSRiAs#gid=0
11:31 <seancribbs> yes, as long as you're considering those minimums
11:34 <seancribbs> what's the RAM/Disk in your current setup like?
11:38 <copious> currently were at 2 machines 8GB of ram each w/ 2TB of disk each
11:38 <copious> moving to 2 machines w/ 32G of ram each w/ 4TB disk each
11:38 <seancribbs> oh, i meant the system you're replacing, but that's cool
11:38 <copious> this is the stopgap measure before moving to a more comprehensive and
linearlly scalable solution
11:39 <copious> which is where riak is a candidate
11:39 <seancribbs> keep in mind 2 machines at N=3 is going to be degenerate
in a lot of ways
11:39 <copious> I'm assuming that with W=3 we would have at least 3 machines
11:40 <seancribbs> yeah
11:41 <copious> the part I don't like, is with these estimates, to store our current #
of documents (2+ Billion) I would need at least 2x that right column
11:41 <copious> which seems a little excessive
11:46 <seancribbs> copious: fwiw, 6 nodes for 2 Bn documents seems small to me
11:47 <seancribbs> i would be concerned about I/O throughput at that size
11:51 <copious> well those 96 and 128 GB lines are mostly pipe dreams,
more realistically, I would think you would want a good chunk of ram to be
available for disk cache, so that is more in the 32/48 ram line
11:51 <copious> so lets call it 12 machines for 2Billion docs.
11:52 <copious> to me that seems large, probably because of what we
have worked with just 2 machines that satisfy the same load
11:52 <seancribbs> So, what I'm concerned about is the partition size
11:52 <* copious> listens
11:52 <seancribbs> copious: but with the same amount of replication?
11:53 <seancribbs> anyway
11:53 <copious> W=2 effectively
11:53 <seancribbs> ok. regarding partition/ring size. you're going to
get better throughput if you have more partitions but you have to balance
that against the physical size of your cluster. too many partitions/node
gets problematic
11:57 <copious> what is a good balance of partitions / node ?
11:57 <seancribbs> between 10 and 50
11:59 <copious> and there is a size limitation per partiion ? or how does that all fit
together
12:00 <seancribbs> no, just think of partitions/vnodes as bottlenecks.
when you have an adequate number, your throughput is good
12:00 <copious> ok
12:01 <seancribbs> if you have too few, you'll have low throughput,
if you have too many, you could overload a node
12:01 <seancribbs> anyway, that's a real simplified version of it
12:01 <copious> sounds good. another question. this one came up when
I was explaining riak to a coworker. What is the cost of rebalancing?
say I have the 12 machines, and the capacity is getting up there and
its time to add new machines to the cluster
12:02 <seancribbs> time to transfer the partitions off disk and
across the network, plus a little. you can transfer up to 4
partitions at a time by default. so at 12 machines, say you have 256 ring size
12:03 <copious> how much does the transfer of the partitions
negatively affect the normal operation ?
12:04 <copious> 256 ring size, add one new machine...
12:04 <seancribbs> so that's a transfer of 19 or 20 partitions.
from among the existing 12 nodes. so… handoff doesn't really hurt normal ops.
but in small clusters you can sometimes get "not found" responses during
handoff
12:08 <copious> would the "not found" happen randomly or does it
happn more with recent or older docs or some other factor ?
12:08 <seancribbs> it happens mostly when basic quorum collides with
partitions in-transit
12:08 <copious> ok
12:08 <seancribbs> but i've only really seen it in small clusters, i.e. < 4 nodes
12:17 <copious> so back of the envelope, 20 partitions to transfer,
each @ ~54 Gib (1Tb per node w/ 21 partitions per node) when 1 machine is added
12:18 <copious> at about 1Gbyte / sec transfer we're lookinag ~20 hours to rebalance,
that sound abour right ?
12:18 <seancribbs> ick, yeah. so, the flip side of this is that you
could go with more, smaller machines. increase the ring size.
54GB per partition sounds like a lot
12:19 <holoway> exactly - if re-balancing time is the concern,
you need to have more data storage nodes
12:23 <copious> I wouldn't say that re-balancing is the concern,
just running some numbers
12:24 <copious> so what is the right size for a partition
12:31 <seancribbs> copious: i would hope much smaller than 54GB…
but I don't know for sure
12:31 <seancribbs> i.e. no real-world data points
15:00 <seancribbs> copious: I was thinking, it would be good to
use basho_bench to preload a bunch of keys to simulate your production
data size, then benchmark something resembling your production load
(in terms of throughput and key-hotness)
15:01 <seancribbs> but maybe start small, then increase the data size
until you start to see drop off
15:02 <seancribbs> the challenge with it all is that there are so many variable
15:45 <copious> seancribbs: our data is pretty LRU, that is, the data that was
inserted into the system most recently is what will be utilized the heaviest
for probably the next day or so.
15:45 <copious> after that, there is a huge drop off, which is why putting
varnish in front of it works out really well for us.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment