This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
protected CrawlController init() throws Exception { | |
final CrawlConfig config = new CrawlConfig(); | |
config.setCrawlStorageFolder(“/tmp”); | |
config.setPolitenessDelay(800); | |
config.setMaxDepthOfCrawling(3); | |
config.setIncludeBinaryContentInCrawling(false); | |
config.setResumableCrawling(true); | |
config.setHaltOnError(false); | |
final BasicURLNormalizer normalizer = BasicURLNormalizer.newBuilder().idnNormalization(BasicURLNormalizer.IdnNormalization.NONE).build(); | |
final PageFetcher pageFetcher = new PageFetcher(config, normalizer); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
name: "crawler" | |
includes: | |
- resource: true | |
file: "/crawler-default.yaml" | |
override: false | |
- resource: false | |
file: "crawler-conf.yaml" | |
override: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
selenium.capabilities: | |
goog:chromeOptions: | |
args: | |
- "--headless" | |
- "--disable-gpu" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<document id='xxxx'> | |
<label>category_of_document</label> | |
<field name='text'>every document has some text</field> | |
<field name='title'>some even have a title</field> | |
<field name='description'>or some meaningful description</field> | |
</document> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<queryParser name="payload" class="com.digitalpebble.solr.PLDisMaxQParserPlugin" /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.digitalpebble.solr; | |
import org.apache.solr.common.params.SolrParams; | |
import org.apache.solr.common.util.NamedList; | |
import org.apache.solr.request.SolrQueryRequest; | |
import org.apache.solr.search.QParser; | |
import org.apache.solr.search.QParserPlugin; | |
public class PLDisMaxQParserPlugin extends QParserPlugin { | |
public void init(NamedList args) { |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.digitalpebble.solr; | |
import org.apache.lucene.analysis.payloads.PayloadHelper; | |
import org.apache.lucene.search.DefaultSimilarity; | |
public class PayloadSimilarity extends DefaultSimilarity | |
{ | |
@Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) | |
{ | |
if (length > 0) { |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.digitalpebble.solr; | |
import java.util.HashSet; | |
import java.util.Iterator; | |
import java.util.List; | |
import java.util.Map; | |
import org.apache.lucene.index.Term; | |
import org.apache.lucene.queryParser.ParseException; | |
import org.apache.lucene.search.BooleanClause; |
NewerOlder