Created
July 21, 2010 19:12
-
-
Save PharkMillups/484979 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
11:29 <copious> hi all, I was working up some number for capacity planning and | |
came up with these estimates, does this look right to folks ? | |
11:29 <copious> http://spreadsheets.google.com/ccc?key=0Amb23lPTCzBkdGlPYnBRaUotWlUxOGtFdHJGTEMzcWc&hl=en&authkey=CMSRiAs#gid=0 | |
11:31 <seancribbs> yes, as long as you're considering those minimums | |
11:34 <seancribbs> what's the RAM/Disk in your current setup like? | |
11:38 <copious> currently were at 2 machines 8GB of ram each w/ 2TB of disk each | |
11:38 <copious> moving to 2 machines w/ 32G of ram each w/ 4TB disk each | |
11:38 <seancribbs> oh, i meant the system you're replacing, but that's cool | |
11:38 <copious> this is the stopgap measure before moving to a more comprehensive and | |
linearlly scalable solution | |
11:39 <copious> which is where riak is a candidate | |
11:39 <seancribbs> keep in mind 2 machines at N=3 is going to be degenerate | |
in a lot of ways | |
11:39 <copious> I'm assuming that with W=3 we would have at least 3 machines | |
11:40 <seancribbs> yeah | |
11:41 <copious> the part I don't like, is with these estimates, to store our current # | |
of documents (2+ Billion) I would need at least 2x that right column | |
11:41 <copious> which seems a little excessive | |
11:46 <seancribbs> copious: fwiw, 6 nodes for 2 Bn documents seems small to me | |
11:47 <seancribbs> i would be concerned about I/O throughput at that size | |
11:51 <copious> well those 96 and 128 GB lines are mostly pipe dreams, | |
more realistically, I would think you would want a good chunk of ram to be | |
available for disk cache, so that is more in the 32/48 ram line | |
11:51 <copious> so lets call it 12 machines for 2Billion docs. | |
11:52 <copious> to me that seems large, probably because of what we | |
have worked with just 2 machines that satisfy the same load | |
11:52 <seancribbs> So, what I'm concerned about is the partition size | |
11:52 <* copious> listens | |
11:52 <seancribbs> copious: but with the same amount of replication? | |
11:53 <seancribbs> anyway | |
11:53 <copious> W=2 effectively | |
11:53 <seancribbs> ok. regarding partition/ring size. you're going to | |
get better throughput if you have more partitions but you have to balance | |
that against the physical size of your cluster. too many partitions/node | |
gets problematic | |
11:57 <copious> what is a good balance of partitions / node ? | |
11:57 <seancribbs> between 10 and 50 | |
11:59 <copious> and there is a size limitation per partiion ? or how does that all fit | |
together | |
12:00 <seancribbs> no, just think of partitions/vnodes as bottlenecks. | |
when you have an adequate number, your throughput is good | |
12:00 <copious> ok | |
12:01 <seancribbs> if you have too few, you'll have low throughput, | |
if you have too many, you could overload a node | |
12:01 <seancribbs> anyway, that's a real simplified version of it | |
12:01 <copious> sounds good. another question. this one came up when | |
I was explaining riak to a coworker. What is the cost of rebalancing? | |
say I have the 12 machines, and the capacity is getting up there and | |
its time to add new machines to the cluster | |
12:02 <seancribbs> time to transfer the partitions off disk and | |
across the network, plus a little. you can transfer up to 4 | |
partitions at a time by default. so at 12 machines, say you have 256 ring size | |
12:03 <copious> how much does the transfer of the partitions | |
negatively affect the normal operation ? | |
12:04 <copious> 256 ring size, add one new machine... | |
12:04 <seancribbs> so that's a transfer of 19 or 20 partitions. | |
from among the existing 12 nodes. so… handoff doesn't really hurt normal ops. | |
but in small clusters you can sometimes get "not found" responses during | |
handoff | |
12:08 <copious> would the "not found" happen randomly or does it | |
happn more with recent or older docs or some other factor ? | |
12:08 <seancribbs> it happens mostly when basic quorum collides with | |
partitions in-transit | |
12:08 <copious> ok | |
12:08 <seancribbs> but i've only really seen it in small clusters, i.e. < 4 nodes | |
12:17 <copious> so back of the envelope, 20 partitions to transfer, | |
each @ ~54 Gib (1Tb per node w/ 21 partitions per node) when 1 machine is added | |
12:18 <copious> at about 1Gbyte / sec transfer we're lookinag ~20 hours to rebalance, | |
that sound abour right ? | |
12:18 <seancribbs> ick, yeah. so, the flip side of this is that you | |
could go with more, smaller machines. increase the ring size. | |
54GB per partition sounds like a lot | |
12:19 <holoway> exactly - if re-balancing time is the concern, | |
you need to have more data storage nodes | |
12:23 <copious> I wouldn't say that re-balancing is the concern, | |
just running some numbers | |
12:24 <copious> so what is the right size for a partition | |
12:31 <seancribbs> copious: i would hope much smaller than 54GB… | |
but I don't know for sure | |
12:31 <seancribbs> i.e. no real-world data points | |
15:00 <seancribbs> copious: I was thinking, it would be good to | |
use basho_bench to preload a bunch of keys to simulate your production | |
data size, then benchmark something resembling your production load | |
(in terms of throughput and key-hotness) | |
15:01 <seancribbs> but maybe start small, then increase the data size | |
until you start to see drop off | |
15:02 <seancribbs> the challenge with it all is that there are so many variable | |
15:45 <copious> seancribbs: our data is pretty LRU, that is, the data that was | |
inserted into the system most recently is what will be utilized the heaviest | |
for probably the next day or so. | |
15:45 <copious> after that, there is a huge drop off, which is why putting | |
varnish in front of it works out really well for us. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment