When you're trying to understand the nature of your query scores, you've probably heard of TFIDF. TFIDF determines how much weight a given term should be given in a particular field by multiplying the term freq * inverse document frequency.
What you may not realize is TFIDF is actually a reflection of the strength of a term in a field. Lucene also employes a method for determining the weight of a term in the query.
To compute this weight, Lucene uses a query normalization process. Query norms reduce every query term's IDF around to the unit vector. Its a single multiplier applied to every IDF. The ultimate impact is to punish proportionally common terms beyond even their low IDF. This works to the point that:
- IDF of proportionally rare terms approaches IDF
- IDF of proportionally common terms approach as low as ~0.1 IDF
Important to remember this is entirely contextual. Given VA state laws, think of these terms with their associated doc freqs:
- deer: 20
- hog: 20
- permit: 2000
A search for "deer hog" allows deer and hog to both recieve equally scaled IDF. In this case IDF * sqrt(2). In the case of "deer permit" permit gets 1/3 IDF while deer gets its full IDF.
In one context the score for rare term deer might be driven by IDF if paired with an equally rare term, say "hog". In a second context, the score for deer might be 0.1 IDF if a much much rare term shows up