Skip to content

Instantly share code, notes, and snippets.

@smerchek
Last active December 17, 2015 02:39
Show Gist options
  • Save smerchek/5537622 to your computer and use it in GitHub Desktop.
Save smerchek/5537622 to your computer and use it in GitHub Desktop.
Strangeloop proposals

##Breaking Down the Lucene Analysis Process

The Lucene analysis process is very powerful, but most of us only know enough of the basics to put together a simple analyzer chain. Search isn't always plug-and-play, and the ability to manipulate and compose tokenizers and token filters will be the differentiator in developing your search product.

Using visualizations of the analysis chain, I will break down the Lucene analysis process to its most basic parts: char filters, tokenizers, and token filters. I'll show how differences in the composition of the token filters affects the final output. We'll see how tokens are more than just a stream; that they can become a token graph using synonyms and generating word parts.

##Reviewer Comments

I've been working directly with Lucene for the past year, implementing Softek's proprietary ranking algorithm for searching radiology documents. In the process, I've submitted patches or extended core Lucene and Solr code. I've implemented our own query parser extension and token filters with a focus on support of payloads. I recently gave a 2 hour presentation on advanced Lucene and Solr concepts at KCDC. In that talk, I focused on the indexing and analysis process, as well as the querying process. This proposal is based largely on the analysis portion of the KCDC talk, reduced to fit into the 40 minute time window.

@dynajoe
Copy link

dynajoe commented May 8, 2013

indexing and analysis process, as well as the querying process -> indexing, analyzing, and querying process.

@dynajoe
Copy link

dynajoe commented May 8, 2013

What are the most basic parts of the analysis process? Is it just tokenizing and token filters? Maybe you should list them.

-- good idea. I've done that.

@smerchek
Copy link
Author

smerchek commented May 8, 2013

I don't think I can tackle indexing, analyzing, and querying process in just 40 minutes; at least not to the depth I'd like to go in the indexing process.

As for proprietary search, I think it's okay to convey it, but I'm not sure if it's necessary. I mostly wanted to convey the use of payloads and custom analysis.

@dynajoe
Copy link

dynajoe commented May 8, 2013

Product is misspelled as (procuct)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment