Skip to content

Instantly share code, notes, and snippets.

@joshdcollins
Last active November 27, 2015 18:29
Show Gist options
  • Save joshdcollins/0e3f24dd23c3fc6ac8e3 to your computer and use it in GitHub Desktop.
Save joshdcollins/0e3f24dd23c3fc6ac8e3 to your computer and use it in GitHub Desktop.
SOLR Query
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true"/>
<field name="_root_" type="string" indexed="true" stored="false"/>
<field name="_text_" type="text_autophrase" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="content" type="text_autophrase" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="entity_name" type="text_autophrase" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="entity_type" type="text_general" indexed="true" stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="entity_date" type="tdate" indexed="true" stored="true" multiValued="false"/>
<field name="entity_sort" type="int" indexed="true" stored="true" multiValued="false"/>
<field name="entity_author" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="content_exact" type="text_autophrase_exact" indexed="true" stored="false" multiValued="true"/>
<field name="entity_name_exact" type="text_autophrase_exact" indexed="true" stored="false" multiValued="false"/>
<copyField source="*" dest="_text_"/>
<copyField source="entity_name" dest="entity_name_exact"/>
<copyField source="content" dest="content_exact"/>
<fieldType name="text_autophrase" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="org.apache.lucene.analysis.autophrase.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="6"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
<fieldType name="text_autophrase_exact" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="org.apache.lucene.analysis.autophrase.AutoPhrasingTokenFilterFactory" phrases="autophrases.txt" includeTokens="true" replaceWhitespaceWith="_" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
var SEARCH_URL_PREFIX = "http://localhost:8983/solr/rms_core_dev/autophrase?q="
var URL = SEARCH_URL_PREFIX + searchTerm +
"&q.op=AND
&wt=json
&defType=dismax
&qf=entity_name_exact^10000.0 content_exact^5000.0 entity_name^1000.0 content^500 entity_author
&pf=entity_name_exact entity_name content content_exact
&bq=(entity_type:company^10000 OR entity_type:insight^7500)
&rows=100
&fl=*,score
&hl=true
&hl.useFastVectorHighlighter=true
&hl.q=" + searchTerm + "
&hl.fl=entity_name content
&hl.bs.maxScan=15
&hl.snippets=1000
&hl.fragsize=50000
@joshdcollins
Copy link
Author

My general objective is:

  • Support both partial and exact matches (have supported this using ngram for partial, and _exact fields for exact matches
  • Weight exact matches as higher than partial matches
  • Weight matches in the entity_name (and entity_name_exact) field higher than in the content (and content_exact) field
  • Weight matches that have an entity type of company as the highest, weight matches that have an entity_type of insight as second highest.

A few questions:

  • My max score is 1.7ish. I'm concerned that my boosting may not be working as I expect if I'm applying boosts (exponents) to a decimal number.
  • Are there any suggestions on how to debug highlighter results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment