PharkMillups/gist:866910

## gistfile1.txt
17:51 <acts_as> rustyk: Can I steal some of your time, re:
-search? Can I setup exact phrase searching? I have a few requirements,
but that's one of the biggest. The other is stemming.

17:53 <rustyk> acts_as: yes, riak supports exact phrase
searching, just put a series of terms in double quotes.
It also supports the proximity operator, ie: "brown dog"~10
will find the words brown and dog within 10 words of

17:53 <rustyk> acts_as: stemming isn't currently supported.
 you could fake it using a custom analyzer, though

17:54 <acts_as> rustyk: Also wondering how riak-search handles
memory? Can I kick a bucket's index out of memory? For each crawl
iteration, I only need the most recent crawl to be readily available.
 Other crawls should just be available.

17:56 <acts_as> rustyk: I think I put this note in the analyzer page,
on the wiki. To see "apps/qilr/src/text_analyzers.erl". Are there any
other files involved? How would I deploy it? Also, can I
have multiple analyzers on a field?

17:56 <rustyk> acts_as: riak-search doesn't actually do any
caching, it basically relies on the OS to cache recent disk accesses

17:57 <acts_as> rustyk: well, I guess I should ask--is the
entire solr index kept in active memory?

17:57 <rustyk> acts_as: custom erlang analyzers are easy…
just create the module and make sure it's availably in the erlang
code path on every machine. then set the schema to {erlang, Module, Function}

17:58 <rustyk> acts_as: er… typo… I meant to say that you set the 'analyzer_factory' for the field to {erlang, Module, Function}

17:58 <acts_as> gotcha. Can I have two analyzers hitting a field?

17:59 <rustyk> acts_as: no, the index is stored on disk. also,
no, you can't have two analyzers hitting the same field. you could
 make a custom analyzer, and run the field through both of the
analyzers, but that might cause problems with phrase searching…
 basically phrase searching relies on the order in which the
analyzer returns tokens

18:00 <rustyk> acts_as: so if your analyzer returns multiple
tokens for a single word, or returns the tokens out of order,
it would break phrase searching

18:02 <acts_as> Hmm, ok. I'm trying to think of where that
would be an issue... Other than modifying the analyzer later on?

18:03 <rustyk> it would only really be an issue if you are doing proximity searches or exact phrase searches

18:05 <acts_as> rustyk: ok, one more question and I'll try to shut up. Another thing I use heavily are protected words. Is it feasible / moronic to maintain those in a bucket/key? ie, stemming "illegal" and "illegible" is an issue, so I need to add "illegible" to a list of words not to stem.

18:07 <acts_as> thank you, btw

18:07 <rustyk> acts_as: I'd recommend that you store the list in
an object (ie: "meta/protected_words"), but then have your custom
 analyzer cache the result for a certain time period… otherwise
you'll incur a KV lookup for each field in the object you are
 indexing, which could slow things down substantially

18:07 <rustyk> acts_as: no problem, happy to help?

18:07 <rustyk> acts_as: not sure why I ended that with a question mark

18:07 <acts_as> yeah, caching was the big question I would've had.

18:07 <acts_as> how would I update the object?

18:08 <acts_as> I suppose that's more of an operational thing,
 not an application requirement

18:09 <acts_as> I suppose that's partly me learning erlang which
should come next.

18:10 <rustyk> that's a good question… not really sure. if you
need to update the object frequently, then (counter intuitively)
it might be best if you hard code the list of reserved words, and
then you can just do a c:nl(ModuleName) to reload the updated module
across all machines in the cluster. That's the benefit of Erlang
hot code loading.

18:12 <acts_as> rustyk: and I'm sure it'd be a lot more
performant. cool, thanks

18:12 <rustyk> acts_as: any time :)
	17:51 <acts_as> rustyk: Can I steal some of your time, re:
	-search? Can I setup exact phrase searching? I have a few requirements,
	but that's one of the biggest. The other is stemming.

	17:53 <rustyk> acts_as: yes, riak supports exact phrase
	searching, just put a series of terms in double quotes.
	It also supports the proximity operator, ie: "brown dog"~10
	will find the words brown and dog within 10 words of

	17:53 <rustyk> acts_as: stemming isn't currently supported.
	you could fake it using a custom analyzer, though

	17:54 <acts_as> rustyk: Also wondering how riak-search handles
	memory? Can I kick a bucket's index out of memory? For each crawl
	iteration, I only need the most recent crawl to be readily available.
	Other crawls should just be available.

	17:56 <acts_as> rustyk: I think I put this note in the analyzer page,
	on the wiki. To see "apps/qilr/src/text_analyzers.erl". Are there any
	other files involved? How would I deploy it? Also, can I
	have multiple analyzers on a field?

	17:56 <rustyk> acts_as: riak-search doesn't actually do any
	caching, it basically relies on the OS to cache recent disk accesses

	17:57 <acts_as> rustyk: well, I guess I should ask--is the
	entire solr index kept in active memory?

	17:57 <rustyk> acts_as: custom erlang analyzers are easy…
	just create the module and make sure it's availably in the erlang
	code path on every machine. then set the schema to {erlang, Module, Function}

	17:58 <rustyk> acts_as: er… typo… I meant to say that you set the 'analyzer_factory' for the field to {erlang, Module, Function}

	17:58 <acts_as> gotcha. Can I have two analyzers hitting a field?

	17:59 <rustyk> acts_as: no, the index is stored on disk. also,
	no, you can't have two analyzers hitting the same field. you could
	make a custom analyzer, and run the field through both of the
	analyzers, but that might cause problems with phrase searching…
	basically phrase searching relies on the order in which the
	analyzer returns tokens

	18:00 <rustyk> acts_as: so if your analyzer returns multiple
	tokens for a single word, or returns the tokens out of order,
	it would break phrase searching

	18:02 <acts_as> Hmm, ok. I'm trying to think of where that
	would be an issue... Other than modifying the analyzer later on?

	18:03 <rustyk> it would only really be an issue if you are doing proximity searches or exact phrase searches

	18:05 <acts_as> rustyk: ok, one more question and I'll try to shut up. Another thing I use heavily are protected words. Is it feasible / moronic to maintain those in a bucket/key? ie, stemming "illegal" and "illegible" is an issue, so I need to add "illegible" to a list of words not to stem.

	18:07 <acts_as> thank you, btw

	18:07 <rustyk> acts_as: I'd recommend that you store the list in
	an object (ie: "meta/protected_words"), but then have your custom
	analyzer cache the result for a certain time period… otherwise
	you'll incur a KV lookup for each field in the object you are
	indexing, which could slow things down substantially

	18:07 <rustyk> acts_as: no problem, happy to help?

	18:07 <rustyk> acts_as: not sure why I ended that with a question mark

	18:07 <acts_as> yeah, caching was the big question I would've had.

	18:07 <acts_as> how would I update the object?

	18:08 <acts_as> I suppose that's more of an operational thing,
	not an application requirement

	18:09 <acts_as> I suppose that's partly me learning erlang which
	should come next.

	18:10 <rustyk> that's a good question… not really sure. if you
	need to update the object frequently, then (counter intuitively)
	it might be best if you hard code the list of reserved words, and
	then you can just do a c:nl(ModuleName) to reload the updated module
	across all machines in the cluster. That's the benefit of Erlang
	hot code loading.

	18:12 <acts_as> rustyk: and I'm sure it'd be a lot more
	performant. cool, thanks

	18:12 <rustyk> acts_as: any time :)