Skip to content

Instantly share code, notes, and snippets.

@PharkMillups
Created March 12, 2011 01:23
Show Gist options
  • Save PharkMillups/866910 to your computer and use it in GitHub Desktop.
Save PharkMillups/866910 to your computer and use it in GitHub Desktop.
17:51 <acts_as> rustyk: Can I steal some of your time, re:
-search? Can I setup exact phrase searching? I have a few requirements,
but that's one of the biggest. The other is stemming.
17:53 <rustyk> acts_as: yes, riak supports exact phrase
searching, just put a series of terms in double quotes.
It also supports the proximity operator, ie: "brown dog"~10
will find the words brown and dog within 10 words of
17:53 <rustyk> acts_as: stemming isn't currently supported.
you could fake it using a custom analyzer, though
17:54 <acts_as> rustyk: Also wondering how riak-search handles
memory? Can I kick a bucket's index out of memory? For each crawl
iteration, I only need the most recent crawl to be readily available.
Other crawls should just be available.
17:56 <acts_as> rustyk: I think I put this note in the analyzer page,
on the wiki. To see "apps/qilr/src/text_analyzers.erl". Are there any
other files involved? How would I deploy it? Also, can I
have multiple analyzers on a field?
17:56 <rustyk> acts_as: riak-search doesn't actually do any
caching, it basically relies on the OS to cache recent disk accesses
17:57 <acts_as> rustyk: well, I guess I should ask--is the
entire solr index kept in active memory?
17:57 <rustyk> acts_as: custom erlang analyzers are easy…
just create the module and make sure it's availably in the erlang
code path on every machine. then set the schema to {erlang, Module, Function}
17:58 <rustyk> acts_as: er… typo… I meant to say that you set the 'analyzer_factory' for the field to {erlang, Module, Function}
17:58 <acts_as> gotcha. Can I have two analyzers hitting a field?
17:59 <rustyk> acts_as: no, the index is stored on disk. also,
no, you can't have two analyzers hitting the same field. you could
make a custom analyzer, and run the field through both of the
analyzers, but that might cause problems with phrase searching…
basically phrase searching relies on the order in which the
analyzer returns tokens
18:00 <rustyk> acts_as: so if your analyzer returns multiple
tokens for a single word, or returns the tokens out of order,
it would break phrase searching
18:02 <acts_as> Hmm, ok. I'm trying to think of where that
would be an issue... Other than modifying the analyzer later on?
18:03 <rustyk> it would only really be an issue if you are doing proximity searches or exact phrase searches
18:05 <acts_as> rustyk: ok, one more question and I'll try to shut up. Another thing I use heavily are protected words. Is it feasible / moronic to maintain those in a bucket/key? ie, stemming "illegal" and "illegible" is an issue, so I need to add "illegible" to a list of words not to stem.
18:07 <acts_as> thank you, btw
18:07 <rustyk> acts_as: I'd recommend that you store the list in
an object (ie: "meta/protected_words"), but then have your custom
analyzer cache the result for a certain time period… otherwise
you'll incur a KV lookup for each field in the object you are
indexing, which could slow things down substantially
18:07 <rustyk> acts_as: no problem, happy to help?
18:07 <rustyk> acts_as: not sure why I ended that with a question mark
18:07 <acts_as> yeah, caching was the big question I would've had.
18:07 <acts_as> how would I update the object?
18:08 <acts_as> I suppose that's more of an operational thing,
not an application requirement
18:09 <acts_as> I suppose that's partly me learning erlang which
should come next.
18:10 <rustyk> that's a good question… not really sure. if you
need to update the object frequently, then (counter intuitively)
it might be best if you hard code the list of reserved words, and
then you can just do a c:nl(ModuleName) to reload the updated module
across all machines in the cluster. That's the benefit of Erlang
hot code loading.
18:12 <acts_as> rustyk: and I'm sure it'd be a lot more
performant. cool, thanks
18:12 <rustyk> acts_as: any time :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment