Created
March 12, 2011 01:23
-
-
Save PharkMillups/866910 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17:51 <acts_as> rustyk: Can I steal some of your time, re: | |
-search? Can I setup exact phrase searching? I have a few requirements, | |
but that's one of the biggest. The other is stemming. | |
17:53 <rustyk> acts_as: yes, riak supports exact phrase | |
searching, just put a series of terms in double quotes. | |
It also supports the proximity operator, ie: "brown dog"~10 | |
will find the words brown and dog within 10 words of | |
17:53 <rustyk> acts_as: stemming isn't currently supported. | |
you could fake it using a custom analyzer, though | |
17:54 <acts_as> rustyk: Also wondering how riak-search handles | |
memory? Can I kick a bucket's index out of memory? For each crawl | |
iteration, I only need the most recent crawl to be readily available. | |
Other crawls should just be available. | |
17:56 <acts_as> rustyk: I think I put this note in the analyzer page, | |
on the wiki. To see "apps/qilr/src/text_analyzers.erl". Are there any | |
other files involved? How would I deploy it? Also, can I | |
have multiple analyzers on a field? | |
17:56 <rustyk> acts_as: riak-search doesn't actually do any | |
caching, it basically relies on the OS to cache recent disk accesses | |
17:57 <acts_as> rustyk: well, I guess I should ask--is the | |
entire solr index kept in active memory? | |
17:57 <rustyk> acts_as: custom erlang analyzers are easy… | |
just create the module and make sure it's availably in the erlang | |
code path on every machine. then set the schema to {erlang, Module, Function} | |
17:58 <rustyk> acts_as: er… typo… I meant to say that you set the 'analyzer_factory' for the field to {erlang, Module, Function} | |
17:58 <acts_as> gotcha. Can I have two analyzers hitting a field? | |
17:59 <rustyk> acts_as: no, the index is stored on disk. also, | |
no, you can't have two analyzers hitting the same field. you could | |
make a custom analyzer, and run the field through both of the | |
analyzers, but that might cause problems with phrase searching… | |
basically phrase searching relies on the order in which the | |
analyzer returns tokens | |
18:00 <rustyk> acts_as: so if your analyzer returns multiple | |
tokens for a single word, or returns the tokens out of order, | |
it would break phrase searching | |
18:02 <acts_as> Hmm, ok. I'm trying to think of where that | |
would be an issue... Other than modifying the analyzer later on? | |
18:03 <rustyk> it would only really be an issue if you are doing proximity searches or exact phrase searches | |
18:05 <acts_as> rustyk: ok, one more question and I'll try to shut up. Another thing I use heavily are protected words. Is it feasible / moronic to maintain those in a bucket/key? ie, stemming "illegal" and "illegible" is an issue, so I need to add "illegible" to a list of words not to stem. | |
18:07 <acts_as> thank you, btw | |
18:07 <rustyk> acts_as: I'd recommend that you store the list in | |
an object (ie: "meta/protected_words"), but then have your custom | |
analyzer cache the result for a certain time period… otherwise | |
you'll incur a KV lookup for each field in the object you are | |
indexing, which could slow things down substantially | |
18:07 <rustyk> acts_as: no problem, happy to help? | |
18:07 <rustyk> acts_as: not sure why I ended that with a question mark | |
18:07 <acts_as> yeah, caching was the big question I would've had. | |
18:07 <acts_as> how would I update the object? | |
18:08 <acts_as> I suppose that's more of an operational thing, | |
not an application requirement | |
18:09 <acts_as> I suppose that's partly me learning erlang which | |
should come next. | |
18:10 <rustyk> that's a good question… not really sure. if you | |
need to update the object frequently, then (counter intuitively) | |
it might be best if you hard code the list of reserved words, and | |
then you can just do a c:nl(ModuleName) to reload the updated module | |
across all machines in the cluster. That's the benefit of Erlang | |
hot code loading. | |
18:12 <acts_as> rustyk: and I'm sure it'd be a lot more | |
performant. cool, thanks | |
18:12 <rustyk> acts_as: any time :) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment