Skip to content

Instantly share code, notes, and snippets.

@kenton
Created April 9, 2013 21:57
Show Gist options
  • Save kenton/5349785 to your computer and use it in GitHub Desktop.
Save kenton/5349785 to your computer and use it in GitHub Desktop.
<!-- current implementation -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<!--<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>-->
</analyzer>
</fieldType>
<!-- proposed updated implementation -->
<fieldType name="text_ws" class="solr.TextField">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- notes / thoughts on our current solr schema
StandardTokenizerFactory - I'd keep the StandardTokenizer vs. the WhitespaceTokenizer
because the StandardTokenizer tokenizes text based on whitespace *and*
word boundary rules specified by Unicode. WhitespaceTokenizer only
tokenizes based on whitespace.
StandardFilterFactory - not sure why we have this one in our schema. According to the docs, this
was used pre-solr3.1. We're running 3.6 on our staging machines, presumably same verson in production.
Either way, unlikely that prod is running < v.3.1 so this could be updated to ClassicFilterFactory.
ClassicFilterFactory should be kept around also. It removes periods from the end of tokens
and from acronyms.
LowercaseFilterFactory - having this makes sense in either case
PorterStemFilterFactory - we need some sort of stemmer in the mix. There are a few to choose from.
I can't find much that gives good technical rationale for choosing one over the other.
ASCIIFoldingFilterFactory - I'd keep the ASCIIFoldingFilter vs. using the MappingCharFilter w/IOSLatin1Accent
The ISOLatin1Accent is is just a mapping of ISO Latin1 characters to ASCII. This
is probably sufficient, but may not be and there could be other characters that
slip through the cracks that we need mapped to ASCII. The Solr 3 book from Packt Publishing
mentions that MappingCharFilter can be used w/FoldToASCII also, but recommends using
ASCIIFoldingFilterFactory instead as it should be faster.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment