Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created January 6, 2011 00:52
Show Gist options
  • Select an option

  • Save PharkMillups/767327 to your computer and use it in GitHub Desktop.

Select an option

Save PharkMillups/767327 to your computer and use it in GitHub Desktop.
07:10 <svjson> Hey guys
07:10 <svjson> Time for a couple of questions?
07:21 <roder> svjson: what's up?
07:22 <svjson> I have some general questions & thoughts. I'm currently
building a prototype for a new application for a customer,
which is using riak for data storage
07:23 <svjson> Or riak-search, rather
07:24 <rustyk> svjson: sure thing, fire away
07:24 <svjson> I did some fiddling last week and got things
to work the way I want
07:24 <svjson> But now I'm wondering a bit about indexing and data size
07:24 <svjson> The application is going to be, more or less,
constantly bombarded with data and will need to store this data
historically for several years
07:26 <svjson> So it's kind of write heavy - all data is
guaranteed to be read at least once, as part of it's life-cycle,
and after that it _may_ be requested again, but will mostly just... lie around
07:27 <svjson> Are there any known limit, or increase in
response time, when hitting an index proportional to the
amount of data than can be predicted?
07:28 <rustyk> svjson: the response time is going to grow proportional
to two things 1) how many entries are in the index under the searched
term and 2) how many objects are in the final result set
07:29 <svjson> Right
07:29 <rustyk> svjson: in other words, roughly how many objects will match each clause of the query, and how many objects will match the entire query
07:31 <svjson> In the case of 1), the first part of a query will
be constantly growing, but the matches for the entire query will
vary within a predictable range
07:32 <svjson> One case, for example is events tied to a specific entity
07:32 <svjson> An entity will usually have 50-60 events, sometimes more,
but never as many as.. say.. a thousand.
07:34 <svjson> Grabbing these 50-60 events with the help of an
index on a reference to such an entity should, in other words not
noticably slow down with a huge data set, unless the query
contains a search term that matches entries outside of these?
07:36 <rustyk> yes, that's correct
07:37 <rustyk> I'd get worried if you started to have clauses
that matched hundreds of thousands to millions entries
(depending on machine size), but 50 − 60 is no problem
07:38 <svjson> well, it might in "extreme" cases go up
to a couple of hundred, but these will not be the norm
07:38 <rustyk> excellent… sounds like really good use case
for Riak Search
07:39 <svjson> Any idea about how large data sets I would need
to test with in order to make sure that I get acceptable
performance?
07:39 <svjson> Or is that like asking how many trees will fit in my garden?
07:40 <svjson> (without more detail)
07:40 <rustyk> yeah… there are many factors… RAM, disk speed, # of processors, network speed, how large your hot set of data is, etc.
07:41 <rustyk> if you haven't played with Basho Bench before,
you might want to check it out… it helps you create repeatable
performance tests for things like this
07:41 <rustyk> plus it will give you an excuse to dive back into Erlang
07:41 <rustyk> ;-)
07:44 <svjson> Right. Will do.
07:45 <svjson> How is Riak aware of and dependent on what
my "hot" data set is?
07:45 <svjson> For caching?
07:46 <svjson> Sounds like a good opportunity, yeah.
07:46 <rustyk> Riak doesn't actually keep track of it, but the
underlying OS will naturally cache that data
07:46 <svjson> Right
07:47 <svjson> Well, during the life-cycle of one of these entities
in this example, events constantly go down into riak, and when the
life-cycle has expired, the events will be audited, and after that
access is arbitrary
07:47 <svjson> Ie, administrators or staff has some sort of reason to go back and look at it. Which is rare
07:48 <svjson> So there won't be a lot of random access, no
07:48 <svjson> However, that brings me to another thing I wanted to bring up. About erlang.
07:49 <svjson> Some more background on this is that, this
project at my customer has been in the pipe for a while and is
informally referred to as "The Cassandra-project".
07:49 <svjson> I'm doing this prototype with Riak because I have a bad gut-feeling about Cassandra.
07:50 <svjson> Someone told me that the fact that I'd written
some erlang to make this work as I wanted, was a potential nail
in the coffin because "This is a java shop"
07:50 <svjson> So I'll be needing some good pro-Riak arguments
to go. I'm sure you're shock-full of those :)
07:51 <svjson> (Yeah, I know the reasoning behind that is so
narrow-mindedly dumb that it's a threat to the fabric of space-time,
but that's the way it is...)
07:52 <rustyk> my only argument is "use the best tool for the job" ;-)
07:53 <rustyk> that said, if they are getting hung up on the custom
extractor you wrote, you could potentially do some more processing
*before* you store the data in Riak that would eliminate the need
for the extractor
07:53 <seancribbs> svjson: point to erjang - even if they are
a Java shop, non-Java languages are bursting on the JVM
07:57 <svjson> rustyk: Yeah, that's true...
07:58 <svjson> sean: If only that was an argument that would bite.
07:58 <seancribbs> svjson: yeah, it was a shot in the dark
07:58 <DeadZen> erjang is pretty cool
07:58 <svjson> I responded with say that if I could learn
enough erlang to get that together in just a day, so should
anyone else be able to
07:58 <seancribbs> you'd be better off arguing the
differences between the data models… Riak's is much
simpler to understand
07:59 <svjson> But I was thinking more about the merits
about the data stores
07:59 <svjson> sean, exactly
07:59 <seancribbs> Cassandra's is more rich
07:59 <seancribbs> IMO those should be the primary concerns
07:59 <svjson> "rich" translates to "weird & complex" in my
book, when it comes to Cassandra
07:59 <seancribbs> otherwise, it's just a piece of infrastructure
07:59 <seancribbs> i mean, do they complain that Oracle
isn't written in Java? I think not
08:00 <DeadZen> the lines between oracle and java will fade ;)
08:00 <peschkaj> svjson: you can also point to apparent
complexity - riak is fairly simple, especially when compared
to the types of tuning you need for Cassandra.
08:00 <DeadZen> is riak HA?
08:00 <seancribbs> DeadZen: depends on what you mean by HA
08:01 <DeadZen> that's true
08:01 <DeadZen> i'm asking because for instance while im
running a map reduce test
08:01 <svjson> peschkaj: Thanks - exactly. Thanks to the
esoteric tuning that seems to be necessary with Cassandra,
I kind of have a gut-feeling that if we do this project with it,
people's phones will start ringing in the middle of night
08:01 <DeadZen> if i introduce a node during the test, things break ;)
08:01 <seancribbs> if you mean, survives single-node failure, then yes
08:01 <DeadZen> i guess that depends on what you mean by survive
08:01 <svjson> But I don't have enough Cassandra experience
to verify that
08:01 <seancribbs> DeadZen: touche
08:02 <DeadZen> heh
08:02 <seancribbs> however, if you introduce a node, you're
causing data to shift around the cluster
08:02 <seancribbs> so… the plan a MapReduce query was
running will become moot
08:02 <seancribbs> I use the word "plan" lightly of course
08:02 <DeadZen> nods i get what you mean
08:03 <DeadZen> so say im using riak as a queue, so until
that's complete (the data migration), the queue will be
unrecognizable/inaccessible ?
08:03 <seancribbs> first, i wouldn't use it as a queue - use a queue.
08:04 <seancribbs> second, it will be accessible, it'll just
be on the other 2 replicas
08:04 <peschkaj> DeadZen: you would still be able to access
your data, it might be slow, depending on your networking gear
08:04 <DeadZen> nods I guess I tried to make a queue to better understand its personality
08:04 <peschkaj> svjson: you could take a look at the Cassandra
documentation or the tutorials from Riptano
08:05 <DeadZen> seancribbs: do you know of a distributed queue
for erlang?
08:05 <seancribbs> rabbitmq-cluster
08:05 <peschkaj> svjson: the other argument I use is that
with riak I get a rock solid set of core functionality and
I can build any other pieces that I need
08:05 <svjson> peschkaj: Build any other pieces - in what respect?
08:05 <DeadZen> seancribbs: yah thats true... do riak and
rabbit play well together? ;)
08:06 <seancribbs> DeadZen: they don't run in the same
erlang nodes, but yes
08:06 <peschkaj> svjson: well, if there's something that I don't
need, it isn't
there. But if I need an index to act a certain way, I can either
use riak-search or else implement my own indexing mechanism
08:07 <peschkaj> svjson: theoretically, I could do the
same thing with Cassandra, but it already has a lot more
complexity that needs to be considered when I'm extending it
08:12 <svjson> peschkaj: Right, but then implementing our own stuff on top of it would back-fire as an argument.
08:12 <svjson> (again - erlang)
08:12 <DeadZen> mapred_verify is cool
08:13 <peschkaj> svjson: true, but ideally you'll
never have to extend. I guess I
just use the simplicity of Riak as an argument in
favor of using it.
08:13 <svjson> There was a recent post...
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis -
that concludes Riak like so:
08:13 <svjson> Best used: If you want something Cassandra-like
(Dynamo-like), but no way you're gonna deal with the bloat and
complexity. If you need very good single-site scalability,
availability and fault-tolerance, but you're ready to pay for
multi-site replication.
08:14 <peschkaj> yup
08:15 <seancribbs> svjson: it was very well put, yes
08:15 <svjson> Although that is the opinion of one individual,
I'd like to know more about what he is actually referring to
08:15 <DeadZen> that sounds like a good synopsis
08:16 <svjson> With bloat I guess he means what you just
mentioned - a lot of features we're not interested in
08:16 <seancribbs> svjson: also SLOC
08:16 <seancribbs> Riak is pretty small, all things considered
08:16 <peschkaj> yup, and riak is also readable. I find
enterprise-y java to be infinitely illegible
08:16 <svjson> Complexity - The BigTable data model, the tuning stuff, etc?
08:17 <peschkaj> yeah, the complexity comes in when you're
dealing with column families, super columns, data modeling
and partitioning
08:17 <seancribbs> svjson: rumor has it that every major
installation of Cassandra has their own customizations too
08:17 <seancribbs> which is a point of complexity
08:17 <peschkaj> there are a lot of tuning parameters when compared to riak
08:18 <DeadZen> and a giant xml config ;)
08:18 <peschkaj> from personal experience - i never got
Cassandra to behave properly when I experimented with it
08:18 <svjson> Yeah, so we're effectively storing chunks of XML,
caring only about a couple of fields in it for indexing, which
Riak-search now handles perfectly fine
08:19 <svjson> DeadZen: Remember, this is a Java-land. Configuring
stuff in a large XML file is considered a good thing over here ;)
08:19 <peschkaj> Java is a DSL for generating XML
08:19 <* seancribbs> likes erlang terms — short and sweet
08:19 <peschkaj> svjson: show them how much it would cost to
implement with Oracle ;)
08:19 <DeadZen> file:consult ;)
08:19 <seancribbs> DeadZen: exactly
08:19 <svjson> peschkaj: Hah - that one goes in my quote-book
08:20 <svjson> So what about that last sentence - what is he referring to?
08:20 <svjson> "but you're ready to pay for multi-site replication" ?
08:20 <DeadZen> riak enterprise
08:20 <peschkaj> Riak can replicate across data centers, but you need
to pay for the enterprise licensing
08:21 <peschkaj> so you could have three data centers and each
one has a complete copy of your data
08:21 <seancribbs> svjson: that's another advantage of riak if you
need long-haul repl… nodes in other DCs aren't part of the same cluster
08:21 <seancribbs> rack-awareness is… sketchy
08:21 <kreynolds> I did not realize you were required to pay for
multi-dc riak setups
08:22 <seancribbs> kreynolds: you also get the 877 number that wakes
me in the middle of the night ;)
08:22 <svjson> Right, there is talk about doing exactly that
08:22 <svjson> Or at least, having a separate "backup-rack" or
something like that.
08:22 <svjson> I haven't been briefed about the exact idea
08:23 <kreynolds> seancribbs: I can definitely see paying to wake
somebody up, I just didn't know that was a requirement
08:23 <seancribbs> kreynolds: we think that if you're ready
for multiple data centers, you're probably ready for top-tier
support too
08:24 <DeadZen> it makes sense that at that level why risk over-looking something
08:24 <kreynolds> I don't know that it's a bad thing, precisely
08:25 <svjson> Yeah, well, I don't think my customer is
going to put neither Riak nor Cassandra into production
without paying for support, anyways
08:25 <seancribbs> svjson: our VP BizDev does a great pitch for our services, if you'd like to be put in touch
08:26 <peschkaj> Paid support is an awesome thing - there's someone you can call to fix things :)
08:26 <svjson> sean: Might come a time for that. At this point,
it's still in a pre-study phase.
08:26 <seancribbs> svjson: gotcha. just let us know
08:27 <svjson> sean: Cool. First step is to convince devs &
architects about the technical merits
08:27 <svjson> About that, another question about the
technical side of things...
08:27 <svjson> I mentioned before that data will be stored for a
number of years, and after that it should be thrown out
08:28 <svjson> Are there any special features to cater for
this, or should I index a timestamp and periodically query
for documents that are older than a certain time limit?
08:28 <seancribbs> svjson: bitcask has expiration built-in
08:28 <seancribbs> but for decidedly smaller timeframes… it's measured in seconds
08:30 <kreynolds> seancribbs: My gut reaction was that paid
support shouldn't be coupled to deployment. As frequently
people want support even in simpler deployments, but in
practice for a really-real multi-dc deployment, there are
probably only a fraction of instances where somebody
*wouldn't* want paid support
08:30 <svjson> How does that work under the hood?
08:31 <seancribbs> svjson: it's two mechanisms, first,
if the expiry_secs configuration is set, it will be
checked against on read. second, when bitcask merges,
expired items will not be written to the new file
08:32 <peschkaj> would there be any harm setting
expiry_secs to something like 31536000?
08:32 <svjson> Umm. When does bitcask merge, and why does it do it? :)
08:32 <seancribbs> peschkaj: i don't see why not
08:32 <seancribbs> svjson: there are various triggers
08:32 <peschkaj> svjson: there's a nice PDF that gives
an overview: http://downloads.basho.com/papers/bitcask-intro.pdf
08:33 <svjson> sean: In my case that would be multiples of
31536000, still not a problem?
08:33 <DeadZen> is there any reason to use innostore over bitcask?
08:33 <seancribbs> Erlang has no problem with large numbers ;)
08:33 <peschkaj> Erlang - it can count!
08:33 <seancribbs> DeadZen: yes, to strictly control memory usage
08:33 <peschkaj> svjson: the merges happen to save
space because of updates or deletions
08:33 <svjson> Amazing language.
08:33 <DeadZen> seancribbs: ahhh
08:33 <DeadZen> seancribbs: at the cost of? request speed? backup time?
08:34 <seancribbs> DeadZen: latency on inno is much less
predictable
08:34 <DeadZen> got it
08:34 <seancribbs> also, it has costlier recovery
08:34 <seancribbs> svjson: https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L47
08:34 <seancribbs> those are the config settings that specify
when bitcask merges
08:35 <seancribbs> there are triggers and thresholds, any of which
can cause merging
08:36 <seancribbs> sorry, got that a little mixed up. thresholds
determine whether a file should be merged, triggers cause merges
08:36 <svjson> Right. But no updates will ever be issued on this data, and deletes would only happen because the data is expired. So nothing would actually trigger a bitcask merge, unless I start reading expired docs
08:36 <seancribbs> svjson: you misunderstand… merges will happen because those expired items become "dead bytes"
08:37 <svjson> sean: So somehow the buckets are scanned for expired items, still?
08:38 <seancribbs> there's an erlang process called
bitcask_merge_worker which periodically checks whether it
should merge files
08:38 <seancribbs> eventually, the space will be reclaimed
08:39 <seancribbs> keep in mind, too, that bitcask stores keys in
RAM. so if you're storing lots of data, plan for lots of RAM
or many machines
08:39 <seancribbs> https://spreadsheets.google.com/a/basho.com/ccc?key=0Ar810jMxoZMDdEdMMnA5Rjl0R1puRlVhR09wajUxWlE&authkey=CJ_ajeUB&hl=en#gid=0
08:40 <svjson> sean: Many "cheap" machines, is the plan
08:40 <seancribbs> sounds good
08:42 <svjson> How would it handle a situation where there's
not enough RAM? Swapping at the cost of performance degradation ?
08:42 <seancribbs> svjson: yes. at that point, you're in a world of hurt
08:43 <peschkaj> svjson: another thing to keep in mind is that
you can't change the number of virtual nodes once you create a
bucket so you need to plan accordingly
08:43 <seancribbs> peschkaj: you mean cluster, but yes
08:43 <peschkaj> derp, yes
08:44 <DeadZen> once you create a bucket you can't change
the number of virtual nodes?
08:44 <seancribbs> cluster
08:44 <seancribbs> not bucket
08:44 <DeadZen> ah
08:44 <seancribbs> generally you plan that AOT
08:44 <svjson> Um, now I'm confused
08:45 <svjson> Once I set up a cluster, I cannot add nodes?
08:45 <seancribbs> the consistent hashing ring is divided into
a number of partitions
08:45 <seancribbs> you can add nodes, but not vnodes
08:45 <seancribbs> 1 vnode belongs to 1 partition
08:45 <svjson> Right, so if I set it up with 64, I could never
outgrow 64 physical nodes
08:46 <seancribbs> right. and that would be bad anyway because you'd have only one vnode per node
08:46 <peschkaj> seancribbs: isn't the recommendation ~10 vnodes per pnode?
08:46 <seancribbs> 10 or more
08:46 <svjson> What are the implications of using a large
number of vnode on a small number of physical nodes?
08:46 <DeadZen> how do you control that?
08:47 <seancribbs> DeadZen: http://wiki.basho.com/Configuration-Files.html#ring_creation_size
08:47 <seancribbs> svjson: more resource utilization
08:48 <seancribbs> keep in mind that within the "good" range
you can usually grow your cluster by a factor of 5 before
needing to consider repartitioning
08:48 <seancribbs> and your needs at 4 nodes will be very different than at 20
08:49 <svjson> Right.
08:50 <svjson> But we know roughly how much data we generate
per day, and for how long we'll want to store it. It should
be possible to come up with a reasonable estimate and add
some margin to that.
08:50 <seancribbs> svjson: yep
08:50 <svjson> It's not a case of "if data comes", but rather "when data comes". :)
08:50 <seancribbs> you're already a step ahead of most people if you're doing capacity estimation
08:52 <svjson> Yeah, but it's not foolproof either.
Organization and amount of and usage patterns of load-bringing
customers might change over time.
08:52 <svjson> What about if we need to grow the amount of partitions in the end - how do we approach this?
08:53 <svjson> Down-time is out of the question
08:53 <peschkaj> svjson: as I understand it, you'd built a
new cluster with more vnodes and then start migrating data to
the new cluster. As you migrate data, you can then decommission
servers from the original cluster and move them into the new cluster
08:54 <seancribbs> what peschkaj said
08:56 <svjson> Right. That adds a couple of things to consider in the application layer
08:56 <svjson> Cheap lunch, but not free :)
08:57 <peschkaj> Yup. If you know you're getting close
to your size limits, you can start writing to the new
cluster and use some kind of shard key on dates that you
keep moving backwards in time as you push data into the new cluster.
08:57 <peschkaj> Too much customer data is usually a good problem to have.
08:58 <DeadZen> so an initial ring creation size of say
512 is overkill?
09:00 <svjson> Hmm. Btw, why is it important to have several vnodes per physical machine?
09:00 <seancribbs> DeadZen: with 512 partitions, you'd reach a
practical limit of 50 nodes
09:00 <seancribbs> svjson: IO concurrency
09:00 <svjson> sean: one concurrent read or write per vnode?
09:00 <DeadZen> interesting aspect to think about
09:00 <seancribbs> svjson: yes, requests are serialized at the vnode level
09:01 <seancribbs> that is, a vnode can handle only
one at a time
09:01 <svjson> Aha. Got it.
09:01 <seancribbs> also, it's important for spread
09:01 <seancribbs> more partitions = more even spread
09:04 <DeadZen> btw i learned riak_core the other day
09:04 <DeadZen> wanted to say its really cool ;)
09:05 <seancribbs> :)
09:07 <DeadZen> i think it helps to learn riak_core
before trying to understand riak
09:07 <DeadZen> but thats just my opinion ;)
09:09 <svjson> So, what about Java client support for riak search?
I've found two java clients, but can't see that any of them mention
riak-search support
09:12 <seancribbs> svjson: there's no specific support,
although we are working on that. but in some circumstances
you can use a Solr client
09:12 <svjson> "some" sounds vague? :)
09:14 <seancribbs> well… the interface is nominally compatible with Solr
09:16 <svjson> Right. And the Solr interface is the way I
query it from my application (Erlang CLI or search-cmd
doesn't seem to be what I'm looking for)
09:46 <svjson> in any case, I think that concludes my Q&A session for the day. Thanks, everyone!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment