Created
January 12, 2011 21:42
-
-
Save PharkMillups/776953 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| 05:53 <RJ2> can anyone elaborate on how efficient doing a "last N results ordered by <datefield> desc" | |
| for stuff in a given bucket is with riak search? | |
| 06:00 <seancribbs> RJ2: depends on how many results you have total, i believe | |
| 06:02 <RJ2> let's say each bucket commonly has 5-10k entries, maybe as many as 100k sometimes | |
| 06:03 <RJ2> and i want to do range queries on a date field, returning no more than 1000 entries at a time | |
| 06:03 <RJ2> and often just the most recent 1000 entries | |
| 06:03 <RJ2> seancribbs: does that seem sensible with riaksearch? | |
| 06:03 <seancribbs> sure | |
| 06:04 <RJ2> each bucket would be a logfile, so i want riacksearch for range queries, | |
| and also text-search on the contents of the log-line | |
| 06:05 <RJ2> and there are lots of logfiles, so maybe a million+ buckets | |
| 06:05 <RJ2> altho most logs are small (<100 lines) | |
| 06:07 <RJ2> another thing i as wondering: what is the jvm/java used for exactly? | |
| if i'm using a basic tokenizer written in erlang do i even need the java part for riak search? | |
| 06:24 <seancribbs> RJ2: sorry, trying to get started on Monday morning. | |
| 06:24 <seancribbs> One thing you'll want to do is to set the default schema… | |
| make it use the whitespace analyzer | |
| 06:25 <RJ2> (np, monday afternoon here.. :) | |
| 06:25 <seancribbs> that way, if you have small codes/non-word stuff in your | |
| log info, it'll still be searchable | |
| 06:25 <seancribbs> so… the JVM. | |
| 06:25 <seancribbs> in 0.14 that's going to be optional, but it was only used so you | |
| could choose existing Lucene analyzers | |
| 06:26 <seancribbs> esp if you had an existing analyzer for your use-case | |
| 06:27 <RJ2> ok good, that's what i was hoping - i could start with whitespace analyzer, | |
| would probably write my own simple one in erlang tho, since i want to index domains from full urls etc too | |
| 06:27 <RJ2> example.com/foo -> "example.com", "foo", "example.com/foo" | |
| 06:27 <seancribbs> sounds like a good plan | |
| 06:28 <RJ2> also i would be useful to be able to treat entries (log-lines) as a doubly | |
| linked list i could walk in either direction | |
| 06:29 <RJ2> could be done using links, but it's a little fiddly since i have to either buffer | |
| 1 line before writing, or update the previous value to set the fwd link to the value i just entered | |
| 06:29 <RJ2> since log lines are being added sequentially in realtime | |
| 06:29 <RJ2> is there any built-in or other optimisation to walk the keys in insertion orer | |
| without using links? | |
| 06:30 <RJ2> (i could however just add a backwards link to each new entry, | |
| pointing to the previous. that wouldn't require any trickery, nice to have fwd links too though) | |
| 06:31 <RJ2> one reason is if i do a search and find a specific key, i would like context - | |
| 5 lines above and below | |
| 06:35 <RJ2> i could also do the context by querying something like "5 values greater/less | |
| than <date field from match> order by date desc/asc", but that;s much more heavyweight | |
| 06:36 <seancribbs> RJ2: there's no built-in mechanism to walk that list, other than to follow | |
| links, but you need to know how many to follow | |
| 06:37 <seancribbs> have you also looked at sequential databases like RRD etc? Riak is | |
| really optimized for random access | |
| 06:38 <RJ2> yes, kv riak isn't really suitable for what i want, but i think | |
| riacksearch might be a candidate. i need text search, and i don't want the | |
| hassle of managing sharding myself if possible | |
| 06:40 <RJ2> by far the most common query i'd nee is "last N lines from a bucket", which i can have a redis | |
| list caching if necessary; trying to get a feel for how well suited riak is for the rest (cache misses, | |
| loading older data, text search) | |
| 06:42 <RJ2> fwiw, it's for irccloud.com - multi-user hosted irc bouncer with web front end | |
| 06:54 <seancribbs> RJ2: sounds like an interesting problem. let us know how it goes, and how else we can help | |
| 06:55 <RJ2> seancribbs: thanks, will do | |
| 07:22 <peschkaj> RJ2: you might want to talk to siculars about using Redis and Riak together. | |
| he was planning a blog post about using Redis as an index on top of Riak, but I don't know how far along he is. | |
| 07:23 <RJ2> peschkaj: thanks | |
| 07:23 <siculars> peschkaj: ha . i'm working on a post now about pagination in riak | |
| 07:24 <siculars> the conclusion is that yes it can be done but i would probably use redis | |
| for the nimble lifting instead | |
| 07:24 <siculars> it being pagination | |
| 07:24 <peschkaj> I look forward to reading it | |
| 07:26 <peschkaj> RJ2: I don't know what your use case is, but you might want to consider | |
| tokenizing the top part of the domain (sub.domain.TLD) and reversing the parts (TLD.domain.sub). | |
| 07:27 <RJ2> peschkaj: mostly so people can search the log for mentions of an url | |
| without knowing the full exact url, eg "there was a mention of some blog post on example.com yesterday.." | |
| 07:28 <peschkaj> ah, yeah, then you can ignore me | |
| 07:28 <RJ2> i might also emit some unique token whenever an url is encountered too, | |
| so i can easily find all lines that contain an url | |
| 07:29 <RJ2> not sure if that's better than a dedicated field for it | |
| 07:31 <peschkaj> you could create a bucket that's inverted index... the keys would be the domain n | |
| ames and the value would be an array of line keys. if you have a lot of data that could be better than using links |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment