Created
January 6, 2011 00:52
-
-
Save PharkMillups/767327 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 07:10 <svjson> Hey guys | |
| 07:10 <svjson> Time for a couple of questions? | |
| 07:21 <roder> svjson: what's up? | |
| 07:22 <svjson> I have some general questions & thoughts. I'm currently | |
| building a prototype for a new application for a customer, | |
| which is using riak for data storage | |
| 07:23 <svjson> Or riak-search, rather | |
| 07:24 <rustyk> svjson: sure thing, fire away | |
| 07:24 <svjson> I did some fiddling last week and got things | |
| to work the way I want | |
| 07:24 <svjson> But now I'm wondering a bit about indexing and data size | |
| 07:24 <svjson> The application is going to be, more or less, | |
| constantly bombarded with data and will need to store this data | |
| historically for several years | |
| 07:26 <svjson> So it's kind of write heavy - all data is | |
| guaranteed to be read at least once, as part of it's life-cycle, | |
| and after that it _may_ be requested again, but will mostly just... lie around | |
| 07:27 <svjson> Are there any known limit, or increase in | |
| response time, when hitting an index proportional to the | |
| amount of data than can be predicted? | |
| 07:28 <rustyk> svjson: the response time is going to grow proportional | |
| to two things 1) how many entries are in the index under the searched | |
| term and 2) how many objects are in the final result set | |
| 07:29 <svjson> Right | |
| 07:29 <rustyk> svjson: in other words, roughly how many objects will match each clause of the query, and how many objects will match the entire query | |
| 07:31 <svjson> In the case of 1), the first part of a query will | |
| be constantly growing, but the matches for the entire query will | |
| vary within a predictable range | |
| 07:32 <svjson> One case, for example is events tied to a specific entity | |
| 07:32 <svjson> An entity will usually have 50-60 events, sometimes more, | |
| but never as many as.. say.. a thousand. | |
| 07:34 <svjson> Grabbing these 50-60 events with the help of an | |
| index on a reference to such an entity should, in other words not | |
| noticably slow down with a huge data set, unless the query | |
| contains a search term that matches entries outside of these? | |
| 07:36 <rustyk> yes, that's correct | |
| 07:37 <rustyk> I'd get worried if you started to have clauses | |
| that matched hundreds of thousands to millions entries | |
| (depending on machine size), but 50 − 60 is no problem | |
| 07:38 <svjson> well, it might in "extreme" cases go up | |
| to a couple of hundred, but these will not be the norm | |
| 07:38 <rustyk> excellent… sounds like really good use case | |
| for Riak Search | |
| 07:39 <svjson> Any idea about how large data sets I would need | |
| to test with in order to make sure that I get acceptable | |
| performance? | |
| 07:39 <svjson> Or is that like asking how many trees will fit in my garden? | |
| 07:40 <svjson> (without more detail) | |
| 07:40 <rustyk> yeah… there are many factors… RAM, disk speed, # of processors, network speed, how large your hot set of data is, etc. | |
| 07:41 <rustyk> if you haven't played with Basho Bench before, | |
| you might want to check it out… it helps you create repeatable | |
| performance tests for things like this | |
| 07:41 <rustyk> plus it will give you an excuse to dive back into Erlang | |
| 07:41 <rustyk> ;-) | |
| 07:44 <svjson> Right. Will do. | |
| 07:45 <svjson> How is Riak aware of and dependent on what | |
| my "hot" data set is? | |
| 07:45 <svjson> For caching? | |
| 07:46 <svjson> Sounds like a good opportunity, yeah. | |
| 07:46 <rustyk> Riak doesn't actually keep track of it, but the | |
| underlying OS will naturally cache that data | |
| 07:46 <svjson> Right | |
| 07:47 <svjson> Well, during the life-cycle of one of these entities | |
| in this example, events constantly go down into riak, and when the | |
| life-cycle has expired, the events will be audited, and after that | |
| access is arbitrary | |
| 07:47 <svjson> Ie, administrators or staff has some sort of reason to go back and look at it. Which is rare | |
| 07:48 <svjson> So there won't be a lot of random access, no | |
| 07:48 <svjson> However, that brings me to another thing I wanted to bring up. About erlang. | |
| 07:49 <svjson> Some more background on this is that, this | |
| project at my customer has been in the pipe for a while and is | |
| informally referred to as "The Cassandra-project". | |
| 07:49 <svjson> I'm doing this prototype with Riak because I have a bad gut-feeling about Cassandra. | |
| 07:50 <svjson> Someone told me that the fact that I'd written | |
| some erlang to make this work as I wanted, was a potential nail | |
| in the coffin because "This is a java shop" | |
| 07:50 <svjson> So I'll be needing some good pro-Riak arguments | |
| to go. I'm sure you're shock-full of those :) | |
| 07:51 <svjson> (Yeah, I know the reasoning behind that is so | |
| narrow-mindedly dumb that it's a threat to the fabric of space-time, | |
| but that's the way it is...) | |
| 07:52 <rustyk> my only argument is "use the best tool for the job" ;-) | |
| 07:53 <rustyk> that said, if they are getting hung up on the custom | |
| extractor you wrote, you could potentially do some more processing | |
| *before* you store the data in Riak that would eliminate the need | |
| for the extractor | |
| 07:53 <seancribbs> svjson: point to erjang - even if they are | |
| a Java shop, non-Java languages are bursting on the JVM | |
| 07:57 <svjson> rustyk: Yeah, that's true... | |
| 07:58 <svjson> sean: If only that was an argument that would bite. | |
| 07:58 <seancribbs> svjson: yeah, it was a shot in the dark | |
| 07:58 <DeadZen> erjang is pretty cool | |
| 07:58 <svjson> I responded with say that if I could learn | |
| enough erlang to get that together in just a day, so should | |
| anyone else be able to | |
| 07:58 <seancribbs> you'd be better off arguing the | |
| differences between the data models… Riak's is much | |
| simpler to understand | |
| 07:59 <svjson> But I was thinking more about the merits | |
| about the data stores | |
| 07:59 <svjson> sean, exactly | |
| 07:59 <seancribbs> Cassandra's is more rich | |
| 07:59 <seancribbs> IMO those should be the primary concerns | |
| 07:59 <svjson> "rich" translates to "weird & complex" in my | |
| book, when it comes to Cassandra | |
| 07:59 <seancribbs> otherwise, it's just a piece of infrastructure | |
| 07:59 <seancribbs> i mean, do they complain that Oracle | |
| isn't written in Java? I think not | |
| 08:00 <DeadZen> the lines between oracle and java will fade ;) | |
| 08:00 <peschkaj> svjson: you can also point to apparent | |
| complexity - riak is fairly simple, especially when compared | |
| to the types of tuning you need for Cassandra. | |
| 08:00 <DeadZen> is riak HA? | |
| 08:00 <seancribbs> DeadZen: depends on what you mean by HA | |
| 08:01 <DeadZen> that's true | |
| 08:01 <DeadZen> i'm asking because for instance while im | |
| running a map reduce test | |
| 08:01 <svjson> peschkaj: Thanks - exactly. Thanks to the | |
| esoteric tuning that seems to be necessary with Cassandra, | |
| I kind of have a gut-feeling that if we do this project with it, | |
| people's phones will start ringing in the middle of night | |
| 08:01 <DeadZen> if i introduce a node during the test, things break ;) | |
| 08:01 <seancribbs> if you mean, survives single-node failure, then yes | |
| 08:01 <DeadZen> i guess that depends on what you mean by survive | |
| 08:01 <svjson> But I don't have enough Cassandra experience | |
| to verify that | |
| 08:01 <seancribbs> DeadZen: touche | |
| 08:02 <DeadZen> heh | |
| 08:02 <seancribbs> however, if you introduce a node, you're | |
| causing data to shift around the cluster | |
| 08:02 <seancribbs> so… the plan a MapReduce query was | |
| running will become moot | |
| 08:02 <seancribbs> I use the word "plan" lightly of course | |
| 08:02 <DeadZen> nods i get what you mean | |
| 08:03 <DeadZen> so say im using riak as a queue, so until | |
| that's complete (the data migration), the queue will be | |
| unrecognizable/inaccessible ? | |
| 08:03 <seancribbs> first, i wouldn't use it as a queue - use a queue. | |
| 08:04 <seancribbs> second, it will be accessible, it'll just | |
| be on the other 2 replicas | |
| 08:04 <peschkaj> DeadZen: you would still be able to access | |
| your data, it might be slow, depending on your networking gear | |
| 08:04 <DeadZen> nods I guess I tried to make a queue to better understand its personality | |
| 08:04 <peschkaj> svjson: you could take a look at the Cassandra | |
| documentation or the tutorials from Riptano | |
| 08:05 <DeadZen> seancribbs: do you know of a distributed queue | |
| for erlang? | |
| 08:05 <seancribbs> rabbitmq-cluster | |
| 08:05 <peschkaj> svjson: the other argument I use is that | |
| with riak I get a rock solid set of core functionality and | |
| I can build any other pieces that I need | |
| 08:05 <svjson> peschkaj: Build any other pieces - in what respect? | |
| 08:05 <DeadZen> seancribbs: yah thats true... do riak and | |
| rabbit play well together? ;) | |
| 08:06 <seancribbs> DeadZen: they don't run in the same | |
| erlang nodes, but yes | |
| 08:06 <peschkaj> svjson: well, if there's something that I don't | |
| need, it isn't | |
| there. But if I need an index to act a certain way, I can either | |
| use riak-search or else implement my own indexing mechanism | |
| 08:07 <peschkaj> svjson: theoretically, I could do the | |
| same thing with Cassandra, but it already has a lot more | |
| complexity that needs to be considered when I'm extending it | |
| 08:12 <svjson> peschkaj: Right, but then implementing our own stuff on top of it would back-fire as an argument. | |
| 08:12 <svjson> (again - erlang) | |
| 08:12 <DeadZen> mapred_verify is cool | |
| 08:13 <peschkaj> svjson: true, but ideally you'll | |
| never have to extend. I guess I | |
| just use the simplicity of Riak as an argument in | |
| favor of using it. | |
| 08:13 <svjson> There was a recent post... | |
| http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - | |
| that concludes Riak like so: | |
| 08:13 <svjson> Best used: If you want something Cassandra-like | |
| (Dynamo-like), but no way you're gonna deal with the bloat and | |
| complexity. If you need very good single-site scalability, | |
| availability and fault-tolerance, but you're ready to pay for | |
| multi-site replication. | |
| 08:14 <peschkaj> yup | |
| 08:15 <seancribbs> svjson: it was very well put, yes | |
| 08:15 <svjson> Although that is the opinion of one individual, | |
| I'd like to know more about what he is actually referring to | |
| 08:15 <DeadZen> that sounds like a good synopsis | |
| 08:16 <svjson> With bloat I guess he means what you just | |
| mentioned - a lot of features we're not interested in | |
| 08:16 <seancribbs> svjson: also SLOC | |
| 08:16 <seancribbs> Riak is pretty small, all things considered | |
| 08:16 <peschkaj> yup, and riak is also readable. I find | |
| enterprise-y java to be infinitely illegible | |
| 08:16 <svjson> Complexity - The BigTable data model, the tuning stuff, etc? | |
| 08:17 <peschkaj> yeah, the complexity comes in when you're | |
| dealing with column families, super columns, data modeling | |
| and partitioning | |
| 08:17 <seancribbs> svjson: rumor has it that every major | |
| installation of Cassandra has their own customizations too | |
| 08:17 <seancribbs> which is a point of complexity | |
| 08:17 <peschkaj> there are a lot of tuning parameters when compared to riak | |
| 08:18 <DeadZen> and a giant xml config ;) | |
| 08:18 <peschkaj> from personal experience - i never got | |
| Cassandra to behave properly when I experimented with it | |
| 08:18 <svjson> Yeah, so we're effectively storing chunks of XML, | |
| caring only about a couple of fields in it for indexing, which | |
| Riak-search now handles perfectly fine | |
| 08:19 <svjson> DeadZen: Remember, this is a Java-land. Configuring | |
| stuff in a large XML file is considered a good thing over here ;) | |
| 08:19 <peschkaj> Java is a DSL for generating XML | |
| 08:19 <* seancribbs> likes erlang terms — short and sweet | |
| 08:19 <peschkaj> svjson: show them how much it would cost to | |
| implement with Oracle ;) | |
| 08:19 <DeadZen> file:consult ;) | |
| 08:19 <seancribbs> DeadZen: exactly | |
| 08:19 <svjson> peschkaj: Hah - that one goes in my quote-book | |
| 08:20 <svjson> So what about that last sentence - what is he referring to? | |
| 08:20 <svjson> "but you're ready to pay for multi-site replication" ? | |
| 08:20 <DeadZen> riak enterprise | |
| 08:20 <peschkaj> Riak can replicate across data centers, but you need | |
| to pay for the enterprise licensing | |
| 08:21 <peschkaj> so you could have three data centers and each | |
| one has a complete copy of your data | |
| 08:21 <seancribbs> svjson: that's another advantage of riak if you | |
| need long-haul repl… nodes in other DCs aren't part of the same cluster | |
| 08:21 <seancribbs> rack-awareness is… sketchy | |
| 08:21 <kreynolds> I did not realize you were required to pay for | |
| multi-dc riak setups | |
| 08:22 <seancribbs> kreynolds: you also get the 877 number that wakes | |
| me in the middle of the night ;) | |
| 08:22 <svjson> Right, there is talk about doing exactly that | |
| 08:22 <svjson> Or at least, having a separate "backup-rack" or | |
| something like that. | |
| 08:22 <svjson> I haven't been briefed about the exact idea | |
| 08:23 <kreynolds> seancribbs: I can definitely see paying to wake | |
| somebody up, I just didn't know that was a requirement | |
| 08:23 <seancribbs> kreynolds: we think that if you're ready | |
| for multiple data centers, you're probably ready for top-tier | |
| support too | |
| 08:24 <DeadZen> it makes sense that at that level why risk over-looking something | |
| 08:24 <kreynolds> I don't know that it's a bad thing, precisely | |
| 08:25 <svjson> Yeah, well, I don't think my customer is | |
| going to put neither Riak nor Cassandra into production | |
| without paying for support, anyways | |
| 08:25 <seancribbs> svjson: our VP BizDev does a great pitch for our services, if you'd like to be put in touch | |
| 08:26 <peschkaj> Paid support is an awesome thing - there's someone you can call to fix things :) | |
| 08:26 <svjson> sean: Might come a time for that. At this point, | |
| it's still in a pre-study phase. | |
| 08:26 <seancribbs> svjson: gotcha. just let us know | |
| 08:27 <svjson> sean: Cool. First step is to convince devs & | |
| architects about the technical merits | |
| 08:27 <svjson> About that, another question about the | |
| technical side of things... | |
| 08:27 <svjson> I mentioned before that data will be stored for a | |
| number of years, and after that it should be thrown out | |
| 08:28 <svjson> Are there any special features to cater for | |
| this, or should I index a timestamp and periodically query | |
| for documents that are older than a certain time limit? | |
| 08:28 <seancribbs> svjson: bitcask has expiration built-in | |
| 08:28 <seancribbs> but for decidedly smaller timeframes… it's measured in seconds | |
| 08:30 <kreynolds> seancribbs: My gut reaction was that paid | |
| support shouldn't be coupled to deployment. As frequently | |
| people want support even in simpler deployments, but in | |
| practice for a really-real multi-dc deployment, there are | |
| probably only a fraction of instances where somebody | |
| *wouldn't* want paid support | |
| 08:30 <svjson> How does that work under the hood? | |
| 08:31 <seancribbs> svjson: it's two mechanisms, first, | |
| if the expiry_secs configuration is set, it will be | |
| checked against on read. second, when bitcask merges, | |
| expired items will not be written to the new file | |
| 08:32 <peschkaj> would there be any harm setting | |
| expiry_secs to something like 31536000? | |
| 08:32 <svjson> Umm. When does bitcask merge, and why does it do it? :) | |
| 08:32 <seancribbs> peschkaj: i don't see why not | |
| 08:32 <seancribbs> svjson: there are various triggers | |
| 08:32 <peschkaj> svjson: there's a nice PDF that gives | |
| an overview: http://downloads.basho.com/papers/bitcask-intro.pdf | |
| 08:33 <svjson> sean: In my case that would be multiples of | |
| 31536000, still not a problem? | |
| 08:33 <DeadZen> is there any reason to use innostore over bitcask? | |
| 08:33 <seancribbs> Erlang has no problem with large numbers ;) | |
| 08:33 <peschkaj> Erlang - it can count! | |
| 08:33 <seancribbs> DeadZen: yes, to strictly control memory usage | |
| 08:33 <peschkaj> svjson: the merges happen to save | |
| space because of updates or deletions | |
| 08:33 <svjson> Amazing language. | |
| 08:33 <DeadZen> seancribbs: ahhh | |
| 08:33 <DeadZen> seancribbs: at the cost of? request speed? backup time? | |
| 08:34 <seancribbs> DeadZen: latency on inno is much less | |
| predictable | |
| 08:34 <DeadZen> got it | |
| 08:34 <seancribbs> also, it has costlier recovery | |
| 08:34 <seancribbs> svjson: https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L47 | |
| 08:34 <seancribbs> those are the config settings that specify | |
| when bitcask merges | |
| 08:35 <seancribbs> there are triggers and thresholds, any of which | |
| can cause merging | |
| 08:36 <seancribbs> sorry, got that a little mixed up. thresholds | |
| determine whether a file should be merged, triggers cause merges | |
| 08:36 <svjson> Right. But no updates will ever be issued on this data, and deletes would only happen because the data is expired. So nothing would actually trigger a bitcask merge, unless I start reading expired docs | |
| 08:36 <seancribbs> svjson: you misunderstand… merges will happen because those expired items become "dead bytes" | |
| 08:37 <svjson> sean: So somehow the buckets are scanned for expired items, still? | |
| 08:38 <seancribbs> there's an erlang process called | |
| bitcask_merge_worker which periodically checks whether it | |
| should merge files | |
| 08:38 <seancribbs> eventually, the space will be reclaimed | |
| 08:39 <seancribbs> keep in mind, too, that bitcask stores keys in | |
| RAM. so if you're storing lots of data, plan for lots of RAM | |
| or many machines | |
| 08:39 <seancribbs> https://spreadsheets.google.com/a/basho.com/ccc?key=0Ar810jMxoZMDdEdMMnA5Rjl0R1puRlVhR09wajUxWlE&authkey=CJ_ajeUB&hl=en#gid=0 | |
| 08:40 <svjson> sean: Many "cheap" machines, is the plan | |
| 08:40 <seancribbs> sounds good | |
| 08:42 <svjson> How would it handle a situation where there's | |
| not enough RAM? Swapping at the cost of performance degradation ? | |
| 08:42 <seancribbs> svjson: yes. at that point, you're in a world of hurt | |
| 08:43 <peschkaj> svjson: another thing to keep in mind is that | |
| you can't change the number of virtual nodes once you create a | |
| bucket so you need to plan accordingly | |
| 08:43 <seancribbs> peschkaj: you mean cluster, but yes | |
| 08:43 <peschkaj> derp, yes | |
| 08:44 <DeadZen> once you create a bucket you can't change | |
| the number of virtual nodes? | |
| 08:44 <seancribbs> cluster | |
| 08:44 <seancribbs> not bucket | |
| 08:44 <DeadZen> ah | |
| 08:44 <seancribbs> generally you plan that AOT | |
| 08:44 <svjson> Um, now I'm confused | |
| 08:45 <svjson> Once I set up a cluster, I cannot add nodes? | |
| 08:45 <seancribbs> the consistent hashing ring is divided into | |
| a number of partitions | |
| 08:45 <seancribbs> you can add nodes, but not vnodes | |
| 08:45 <seancribbs> 1 vnode belongs to 1 partition | |
| 08:45 <svjson> Right, so if I set it up with 64, I could never | |
| outgrow 64 physical nodes | |
| 08:46 <seancribbs> right. and that would be bad anyway because you'd have only one vnode per node | |
| 08:46 <peschkaj> seancribbs: isn't the recommendation ~10 vnodes per pnode? | |
| 08:46 <seancribbs> 10 or more | |
| 08:46 <svjson> What are the implications of using a large | |
| number of vnode on a small number of physical nodes? | |
| 08:46 <DeadZen> how do you control that? | |
| 08:47 <seancribbs> DeadZen: http://wiki.basho.com/Configuration-Files.html#ring_creation_size | |
| 08:47 <seancribbs> svjson: more resource utilization | |
| 08:48 <seancribbs> keep in mind that within the "good" range | |
| you can usually grow your cluster by a factor of 5 before | |
| needing to consider repartitioning | |
| 08:48 <seancribbs> and your needs at 4 nodes will be very different than at 20 | |
| 08:49 <svjson> Right. | |
| 08:50 <svjson> But we know roughly how much data we generate | |
| per day, and for how long we'll want to store it. It should | |
| be possible to come up with a reasonable estimate and add | |
| some margin to that. | |
| 08:50 <seancribbs> svjson: yep | |
| 08:50 <svjson> It's not a case of "if data comes", but rather "when data comes". :) | |
| 08:50 <seancribbs> you're already a step ahead of most people if you're doing capacity estimation | |
| 08:52 <svjson> Yeah, but it's not foolproof either. | |
| Organization and amount of and usage patterns of load-bringing | |
| customers might change over time. | |
| 08:52 <svjson> What about if we need to grow the amount of partitions in the end - how do we approach this? | |
| 08:53 <svjson> Down-time is out of the question | |
| 08:53 <peschkaj> svjson: as I understand it, you'd built a | |
| new cluster with more vnodes and then start migrating data to | |
| the new cluster. As you migrate data, you can then decommission | |
| servers from the original cluster and move them into the new cluster | |
| 08:54 <seancribbs> what peschkaj said | |
| 08:56 <svjson> Right. That adds a couple of things to consider in the application layer | |
| 08:56 <svjson> Cheap lunch, but not free :) | |
| 08:57 <peschkaj> Yup. If you know you're getting close | |
| to your size limits, you can start writing to the new | |
| cluster and use some kind of shard key on dates that you | |
| keep moving backwards in time as you push data into the new cluster. | |
| 08:57 <peschkaj> Too much customer data is usually a good problem to have. | |
| 08:58 <DeadZen> so an initial ring creation size of say | |
| 512 is overkill? | |
| 09:00 <svjson> Hmm. Btw, why is it important to have several vnodes per physical machine? | |
| 09:00 <seancribbs> DeadZen: with 512 partitions, you'd reach a | |
| practical limit of 50 nodes | |
| 09:00 <seancribbs> svjson: IO concurrency | |
| 09:00 <svjson> sean: one concurrent read or write per vnode? | |
| 09:00 <DeadZen> interesting aspect to think about | |
| 09:00 <seancribbs> svjson: yes, requests are serialized at the vnode level | |
| 09:01 <seancribbs> that is, a vnode can handle only | |
| one at a time | |
| 09:01 <svjson> Aha. Got it. | |
| 09:01 <seancribbs> also, it's important for spread | |
| 09:01 <seancribbs> more partitions = more even spread | |
| 09:04 <DeadZen> btw i learned riak_core the other day | |
| 09:04 <DeadZen> wanted to say its really cool ;) | |
| 09:05 <seancribbs> :) | |
| 09:07 <DeadZen> i think it helps to learn riak_core | |
| before trying to understand riak | |
| 09:07 <DeadZen> but thats just my opinion ;) | |
| 09:09 <svjson> So, what about Java client support for riak search? | |
| I've found two java clients, but can't see that any of them mention | |
| riak-search support | |
| 09:12 <seancribbs> svjson: there's no specific support, | |
| although we are working on that. but in some circumstances | |
| you can use a Solr client | |
| 09:12 <svjson> "some" sounds vague? :) | |
| 09:14 <seancribbs> well… the interface is nominally compatible with Solr | |
| 09:16 <svjson> Right. And the Solr interface is the way I | |
| query it from my application (Erlang CLI or search-cmd | |
| doesn't seem to be what I'm looking for) | |
| 09:46 <svjson> in any case, I think that concludes my Q&A session for the day. Thanks, everyone! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment