Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created January 12, 2011 21:42
Show Gist options
  • Select an option

  • Save PharkMillups/776953 to your computer and use it in GitHub Desktop.

Select an option

Save PharkMillups/776953 to your computer and use it in GitHub Desktop.
05:53 <RJ2> can anyone elaborate on how efficient doing a "last N results ordered by <datefield> desc"
for stuff in a given bucket is with riak search?
06:00 <seancribbs> RJ2: depends on how many results you have total, i believe
06:02 <RJ2> let's say each bucket commonly has 5-10k entries, maybe as many as 100k sometimes
06:03 <RJ2> and i want to do range queries on a date field, returning no more than 1000 entries at a time
06:03 <RJ2> and often just the most recent 1000 entries
06:03 <RJ2> seancribbs: does that seem sensible with riaksearch?
06:03 <seancribbs> sure
06:04 <RJ2> each bucket would be a logfile, so i want riacksearch for range queries,
and also text-search on the contents of the log-line
06:05 <RJ2> and there are lots of logfiles, so maybe a million+ buckets
06:05 <RJ2> altho most logs are small (<100 lines)
06:07 <RJ2> another thing i as wondering: what is the jvm/java used for exactly?
if i'm using a basic tokenizer written in erlang do i even need the java part for riak search?
06:24 <seancribbs> RJ2: sorry, trying to get started on Monday morning.
06:24 <seancribbs> One thing you'll want to do is to set the default schema…
make it use the whitespace analyzer
06:25 <RJ2> (np, monday afternoon here.. :)
06:25 <seancribbs> that way, if you have small codes/non-word stuff in your
log info, it'll still be searchable
06:25 <seancribbs> so… the JVM.
06:25 <seancribbs> in 0.14 that's going to be optional, but it was only used so you
could choose existing Lucene analyzers
06:26 <seancribbs> esp if you had an existing analyzer for your use-case
06:27 <RJ2> ok good, that's what i was hoping - i could start with whitespace analyzer,
would probably write my own simple one in erlang tho, since i want to index domains from full urls etc too
06:27 <RJ2> example.com/foo -> "example.com", "foo", "example.com/foo"
06:27 <seancribbs> sounds like a good plan
06:28 <RJ2> also i would be useful to be able to treat entries (log-lines) as a doubly
linked list i could walk in either direction
06:29 <RJ2> could be done using links, but it's a little fiddly since i have to either buffer
1 line before writing, or update the previous value to set the fwd link to the value i just entered
06:29 <RJ2> since log lines are being added sequentially in realtime
06:29 <RJ2> is there any built-in or other optimisation to walk the keys in insertion orer
without using links?
06:30 <RJ2> (i could however just add a backwards link to each new entry,
pointing to the previous. that wouldn't require any trickery, nice to have fwd links too though)
06:31 <RJ2> one reason is if i do a search and find a specific key, i would like context -
5 lines above and below
06:35 <RJ2> i could also do the context by querying something like "5 values greater/less
than <date field from match> order by date desc/asc", but that;s much more heavyweight
06:36 <seancribbs> RJ2: there's no built-in mechanism to walk that list, other than to follow
links, but you need to know how many to follow
06:37 <seancribbs> have you also looked at sequential databases like RRD etc? Riak is
really optimized for random access
06:38 <RJ2> yes, kv riak isn't really suitable for what i want, but i think
riacksearch might be a candidate. i need text search, and i don't want the
hassle of managing sharding myself if possible
06:40 <RJ2> by far the most common query i'd nee is "last N lines from a bucket", which i can have a redis
list caching if necessary; trying to get a feel for how well suited riak is for the rest (cache misses,
loading older data, text search)
06:42 <RJ2> fwiw, it's for irccloud.com - multi-user hosted irc bouncer with web front end
06:54 <seancribbs> RJ2: sounds like an interesting problem. let us know how it goes, and how else we can help
06:55 <RJ2> seancribbs: thanks, will do
07:22 <peschkaj> RJ2: you might want to talk to siculars about using Redis and Riak together.
he was planning a blog post about using Redis as an index on top of Riak, but I don't know how far along he is.
07:23 <RJ2> peschkaj: thanks
07:23 <siculars> peschkaj: ha . i'm working on a post now about pagination in riak
07:24 <siculars> the conclusion is that yes it can be done but i would probably use redis
for the nimble lifting instead
07:24 <siculars> it being pagination
07:24 <peschkaj> I look forward to reading it
07:26 <peschkaj> RJ2: I don't know what your use case is, but you might want to consider
tokenizing the top part of the domain (sub.domain.TLD) and reversing the parts (TLD.domain.sub).
07:27 <RJ2> peschkaj: mostly so people can search the log for mentions of an url
without knowing the full exact url, eg "there was a mention of some blog post on example.com yesterday.."
07:28 <peschkaj> ah, yeah, then you can ignore me
07:28 <RJ2> i might also emit some unique token whenever an url is encountered too,
so i can easily find all lines that contain an url
07:29 <RJ2> not sure if that's better than a dedicated field for it
07:31 <peschkaj> you could create a bucket that's inverted index... the keys would be the domain n
ames and the value would be an array of line keys. if you have a lot of data that could be better than using links
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment