Created
April 9, 2013 21:57
-
-
Save kenton/5349785 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!-- current implementation --> | |
<fieldType name="text" class="solr.TextField" omitNorms="false"> | |
<analyzer> | |
<tokenizer class="solr.StandardTokenizerFactory"/> | |
<filter class="solr.StandardFilterFactory"/> | |
<filter class="solr.LowerCaseFilterFactory"/> | |
<filter class="solr.PorterStemFilterFactory"/> | |
<filter class="solr.ASCIIFoldingFilterFactory"/> | |
<!--<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>--> | |
</analyzer> | |
</fieldType> | |
<!-- proposed updated implementation --> | |
<fieldType name="text_ws" class="solr.TextField"> | |
<analyzer> | |
<charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt"/> | |
<tokenizer class="solr.WhitespaceTokenizerFactory"/> | |
<filter class="solr.LowerCaseFilterFactory"/> | |
</analyzer> | |
</fieldType> | |
<!-- notes / thoughts on our current solr schema | |
StandardTokenizerFactory - I'd keep the StandardTokenizer vs. the WhitespaceTokenizer | |
because the StandardTokenizer tokenizes text based on whitespace *and* | |
word boundary rules specified by Unicode. WhitespaceTokenizer only | |
tokenizes based on whitespace. | |
StandardFilterFactory - not sure why we have this one in our schema. According to the docs, this | |
was used pre-solr3.1. We're running 3.6 on our staging machines, presumably same verson in production. | |
Either way, unlikely that prod is running < v.3.1 so this could be updated to ClassicFilterFactory. | |
ClassicFilterFactory should be kept around also. It removes periods from the end of tokens | |
and from acronyms. | |
LowercaseFilterFactory - having this makes sense in either case | |
PorterStemFilterFactory - we need some sort of stemmer in the mix. There are a few to choose from. | |
I can't find much that gives good technical rationale for choosing one over the other. | |
ASCIIFoldingFilterFactory - I'd keep the ASCIIFoldingFilter vs. using the MappingCharFilter w/IOSLatin1Accent | |
The ISOLatin1Accent is is just a mapping of ISO Latin1 characters to ASCII. This | |
is probably sufficient, but may not be and there could be other characters that | |
slip through the cracks that we need mapped to ASCII. The Solr 3 book from Packt Publishing | |
mentions that MappingCharFilter can be used w/FoldToASCII also, but recommends using | |
ASCIIFoldingFilterFactory instead as it should be faster. | |
--> | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment