Skip to content

Instantly share code, notes, and snippets.

@softwaredoug
Last active August 29, 2015 14:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save softwaredoug/938f1bdf8d13a5145215 to your computer and use it in GitHub Desktop.
Save softwaredoug/938f1bdf8d13a5145215 to your computer and use it in GitHub Desktop.
Notes on Lucene query normalization

When you're trying to understand the nature of your query scores, you've probably heard of TFIDF. TFIDF determines how much weight a given term should be given in a particular field by multiplying the term freq * inverse document frequency.

What you may not realize is TFIDF is actually a reflection of the strength of a term in a field. Lucene also employes a method for determining the weight of a term in the query.

To compute this weight, Lucene uses a query normalization process. Query norms reduce every query term's IDF around to the unit vector. Its a single multiplier applied to every IDF. The ultimate impact is to punish proportionally common terms beyond even their low IDF. This works to the point that:

  • IDF of proportionally rare terms approaches IDF
  • IDF of proportionally common terms approach as low as ~0.1 IDF

Important to remember this is entirely contextual. Given VA state laws, think of these terms with their associated doc freqs:

  • deer: 20
  • hog: 20
  • permit: 2000

A search for "deer hog" allows deer and hog to both recieve equally scaled IDF. In this case IDF * sqrt(2). In the case of "deer permit" permit gets 1/3 IDF while deer gets its full IDF.

In one context the score for rare term deer might be driven by IDF if paired with an equally rare term, say "hog". In a second context, the score for deer might be 0.1 IDF if a much much rare term shows up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment