Skip to content

Instantly share code, notes, and snippets.

@samzhang111
Created January 10, 2015 02:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samzhang111/b1215628ce6f2361d355 to your computer and use it in GitHub Desktop.
Save samzhang111/b1215628ce6f2361d355 to your computer and use it in GitHub Desktop.
Deduplication with Nutch
<property>
<name>db.signature.class</name>
<value>org.apache.nutch.crawl.TextProfileSignature</value>
<description>The default implementation of a page signature. Signatures
created with this implementation will be used for duplicate detection
and removal.</description>
</property>
<property>
<name>db.signature.text_profile.min_token_len</name>
<value>2</value>
<description>Minimum token length to be included in the signature.
</description>
</property>
<property>
<name>db.signature.text_profile.quant_rate</name>
<value>0.01</value>
<description>Profile frequencies will be rounded down to a multiple of
QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token
frequency. If maxFreq > 1 then QUANT will be at least 2, which means that
for longer texts tokens with frequency 1 will always be discarded.
</description>
</property>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment