PharkMillups/gist:776953

## gistfile1.txt
05:53 <RJ2> can anyone elaborate on how efficient doing a "last N results ordered by <datefield> desc"
for stuff in a given bucket is with riak search?

06:00 <seancribbs> RJ2: depends on how many results you have total, i believe

06:02 <RJ2> let's say each bucket commonly has 5-10k entries, maybe as many as 100k sometimes

06:03 <RJ2> and i want to do range queries on a date field, returning no more than 1000 entries at a time

06:03 <RJ2> and often just the most recent 1000 entries

06:03 <RJ2> seancribbs: does that seem sensible with riaksearch?

06:03 <seancribbs> sure

06:04 <RJ2> each bucket would be a logfile, so i want riacksearch for range queries,
and also text-search on the contents of the log-line

06:05 <RJ2> and there are lots of logfiles, so maybe a million+ buckets

06:05 <RJ2> altho most logs are small (<100 lines)

06:07 <RJ2> another thing i as wondering: what is the jvm/java used for exactly?
if i'm using a basic tokenizer written in erlang do i even need the java part for riak search?

06:24 <seancribbs> RJ2: sorry, trying to get started on Monday morning.

06:24 <seancribbs> One thing you'll want to do is to set the default schema…
make it use the whitespace analyzer

06:25 <RJ2> (np, monday afternoon here.. :)

06:25 <seancribbs> that way, if you have small codes/non-word stuff in your
log info, it'll still be searchable

06:25 <seancribbs> so… the JVM.

06:25 <seancribbs> in 0.14 that's going to be optional, but it was only used so you
could choose existing Lucene analyzers

06:26 <seancribbs> esp if you had an existing analyzer for your use-case

06:27 <RJ2> ok good, that's what i was hoping - i could start with whitespace analyzer,
would probably write my own simple one in erlang tho, since i want to index domains from full urls etc too

06:27 <RJ2> example.com/foo -> "example.com", "foo", "example.com/foo"

06:27 <seancribbs> sounds like a good plan

06:28 <RJ2> also i would be useful to be able to treat entries (log-lines) as a doubly
linked list i could walk in either direction

06:29 <RJ2> could be done using links, but it's a little fiddly since i have to either buffer
1 line before writing, or update the previous value to set the fwd link to the value i just entered

06:29 <RJ2> since log lines are being added sequentially in realtime

06:29 <RJ2> is there any built-in or other optimisation to walk the keys in insertion orer
without using links?

06:30 <RJ2> (i could however just add a backwards link to each new entry,
pointing to the previous. that wouldn't require any trickery, nice to have fwd links too though)

06:31 <RJ2> one reason is if i do a search and find a specific key, i would like context -
5 lines above and below


06:35 <RJ2> i could also do the context by querying something like "5 values greater/less
than <date field from match> order by date desc/asc", but that;s much more heavyweight

06:36 <seancribbs> RJ2: there's no built-in mechanism to walk that list, other than to follow
links, but you need to know how many to follow

06:37 <seancribbs> have you also looked at sequential databases like RRD etc? Riak is
really optimized for random access

06:38 <RJ2> yes, kv riak isn't really suitable for what i want, but i think
riacksearch might be a candidate. i need text search, and i don't want the
hassle of managing sharding myself if possible

06:40 <RJ2> by far the most common query i'd nee is "last N lines from a bucket", which i can have a redis
list caching if necessary; trying to get a feel for how well suited riak is for the rest (cache misses,
loading older data, text search)

06:42 <RJ2> fwiw, it's for irccloud.com - multi-user hosted irc bouncer with web front end

06:54 <seancribbs> RJ2: sounds like an interesting problem. let us know how it goes, and how else we can help

06:55 <RJ2> seancribbs: thanks, will do

07:22 <peschkaj> RJ2: you might want to talk to siculars about using Redis and Riak together.
he was planning a blog post about using Redis as an index on top of Riak, but I don't know how far along he is.

07:23 <RJ2> peschkaj: thanks
07:23 <siculars> peschkaj: ha . i'm working on a post now about pagination in riak

07:24 <siculars> the conclusion is that yes it can be done but i would probably use redis
for the nimble lifting instead

07:24 <siculars> it being pagination

07:24 <peschkaj> I look forward to reading it

07:26 <peschkaj> RJ2: I don't know what your use case is, but you might want to consider
tokenizing the top part of the domain (sub.domain.TLD) and reversing the parts (TLD.domain.sub).

07:27 <RJ2> peschkaj: mostly so people can search the log for mentions of an url
without knowing the full exact url, eg "there was a mention of some blog post on example.com yesterday.."

07:28 <peschkaj> ah, yeah, then you can ignore me

07:28 <RJ2> i might also emit some unique token whenever an url is encountered too,
so i can easily find all lines that contain an url

07:29 <RJ2> not sure if that's better than a dedicated field for it

07:31 <peschkaj> you could create a bucket that's inverted index... the keys would be the domain n
ames and the value would be an array of line keys. if you have a lot of data that could be better than using links
	05:53 <RJ2> can anyone elaborate on how efficient doing a "last N results ordered by <datefield> desc"
	for stuff in a given bucket is with riak search?

	06:00 <seancribbs> RJ2: depends on how many results you have total, i believe

	06:02 <RJ2> let's say each bucket commonly has 5-10k entries, maybe as many as 100k sometimes

	06:03 <RJ2> and i want to do range queries on a date field, returning no more than 1000 entries at a time

	06:03 <RJ2> and often just the most recent 1000 entries

	06:03 <RJ2> seancribbs: does that seem sensible with riaksearch?

	06:03 <seancribbs> sure

	06:04 <RJ2> each bucket would be a logfile, so i want riacksearch for range queries,
	and also text-search on the contents of the log-line

	06:05 <RJ2> and there are lots of logfiles, so maybe a million+ buckets

	06:05 <RJ2> altho most logs are small (<100 lines)

	06:07 <RJ2> another thing i as wondering: what is the jvm/java used for exactly?
	if i'm using a basic tokenizer written in erlang do i even need the java part for riak search?

	06:24 <seancribbs> RJ2: sorry, trying to get started on Monday morning.

	06:24 <seancribbs> One thing you'll want to do is to set the default schema…
	make it use the whitespace analyzer

	06:25 <RJ2> (np, monday afternoon here.. :)

	06:25 <seancribbs> that way, if you have small codes/non-word stuff in your
	log info, it'll still be searchable

	06:25 <seancribbs> so… the JVM.

	06:25 <seancribbs> in 0.14 that's going to be optional, but it was only used so you
	could choose existing Lucene analyzers

	06:26 <seancribbs> esp if you had an existing analyzer for your use-case

	06:27 <RJ2> ok good, that's what i was hoping - i could start with whitespace analyzer,
	would probably write my own simple one in erlang tho, since i want to index domains from full urls etc too

	06:27 <RJ2> example.com/foo -> "example.com", "foo", "example.com/foo"

	06:27 <seancribbs> sounds like a good plan

	06:28 <RJ2> also i would be useful to be able to treat entries (log-lines) as a doubly
	linked list i could walk in either direction

	06:29 <RJ2> could be done using links, but it's a little fiddly since i have to either buffer
	1 line before writing, or update the previous value to set the fwd link to the value i just entered

	06:29 <RJ2> since log lines are being added sequentially in realtime

	06:29 <RJ2> is there any built-in or other optimisation to walk the keys in insertion orer
	without using links?

	06:30 <RJ2> (i could however just add a backwards link to each new entry,
	pointing to the previous. that wouldn't require any trickery, nice to have fwd links too though)

	06:31 <RJ2> one reason is if i do a search and find a specific key, i would like context -
	5 lines above and below


	06:35 <RJ2> i could also do the context by querying something like "5 values greater/less
	than <date field from match> order by date desc/asc", but that;s much more heavyweight

	06:36 <seancribbs> RJ2: there's no built-in mechanism to walk that list, other than to follow
	links, but you need to know how many to follow

	06:37 <seancribbs> have you also looked at sequential databases like RRD etc? Riak is
	really optimized for random access

	06:38 <RJ2> yes, kv riak isn't really suitable for what i want, but i think
	riacksearch might be a candidate. i need text search, and i don't want the
	hassle of managing sharding myself if possible

	06:40 <RJ2> by far the most common query i'd nee is "last N lines from a bucket", which i can have a redis
	list caching if necessary; trying to get a feel for how well suited riak is for the rest (cache misses,
	loading older data, text search)

	06:42 <RJ2> fwiw, it's for irccloud.com - multi-user hosted irc bouncer with web front end

	06:54 <seancribbs> RJ2: sounds like an interesting problem. let us know how it goes, and how else we can help

	06:55 <RJ2> seancribbs: thanks, will do

	07:22 <peschkaj> RJ2: you might want to talk to siculars about using Redis and Riak together.
	he was planning a blog post about using Redis as an index on top of Riak, but I don't know how far along he is.

	07:23 <RJ2> peschkaj: thanks
	07:23 <siculars> peschkaj: ha . i'm working on a post now about pagination in riak

	07:24 <siculars> the conclusion is that yes it can be done but i would probably use redis
	for the nimble lifting instead

	07:24 <siculars> it being pagination

	07:24 <peschkaj> I look forward to reading it

	07:26 <peschkaj> RJ2: I don't know what your use case is, but you might want to consider
	tokenizing the top part of the domain (sub.domain.TLD) and reversing the parts (TLD.domain.sub).

	07:27 <RJ2> peschkaj: mostly so people can search the log for mentions of an url
	without knowing the full exact url, eg "there was a mention of some blog post on example.com yesterday.."

	07:28 <peschkaj> ah, yeah, then you can ignore me

	07:28 <RJ2> i might also emit some unique token whenever an url is encountered too,
	so i can easily find all lines that contain an url

	07:29 <RJ2> not sure if that's better than a dedicated field for it

	07:31 <peschkaj> you could create a bucket that's inverted index... the keys would be the domain n
	ames and the value would be an array of line keys. if you have a lot of data that could be better than using links
No results found