This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| protected CrawlController init() throws Exception { | |
| final CrawlConfig config = new CrawlConfig(); | |
| config.setCrawlStorageFolder(“/tmp”); | |
| config.setPolitenessDelay(800); | |
| config.setMaxDepthOfCrawling(3); | |
| config.setIncludeBinaryContentInCrawling(false); | |
| config.setResumableCrawling(true); | |
| config.setHaltOnError(false); | |
| final BasicURLNormalizer normalizer = BasicURLNormalizer.newBuilder().idnNormalization(BasicURLNormalizer.IdnNormalization.NONE).build(); | |
| final PageFetcher pageFetcher = new PageFetcher(config, normalizer); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| name: "crawler" | |
| includes: | |
| - resource: true | |
| file: "/crawler-default.yaml" | |
| override: false | |
| - resource: false | |
| file: "crawler-conf.yaml" | |
| override: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| selenium.capabilities: | |
| goog:chromeOptions: | |
| args: | |
| - "--headless" | |
| - "--disable-gpu" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <document id='xxxx'> | |
| <label>category_of_document</label> | |
| <field name='text'>every document has some text</field> | |
| <field name='title'>some even have a title</field> | |
| <field name='description'>or some meaningful description</field> | |
| </document> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <queryParser name="payload" class="com.digitalpebble.solr.PLDisMaxQParserPlugin" /> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| package com.digitalpebble.solr; | |
| import org.apache.solr.common.params.SolrParams; | |
| import org.apache.solr.common.util.NamedList; | |
| import org.apache.solr.request.SolrQueryRequest; | |
| import org.apache.solr.search.QParser; | |
| import org.apache.solr.search.QParserPlugin; | |
| public class PLDisMaxQParserPlugin extends QParserPlugin { | |
| public void init(NamedList args) { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| package com.digitalpebble.solr; | |
| import org.apache.lucene.analysis.payloads.PayloadHelper; | |
| import org.apache.lucene.search.DefaultSimilarity; | |
| public class PayloadSimilarity extends DefaultSimilarity | |
| { | |
| @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) | |
| { | |
| if (length > 0) { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| package com.digitalpebble.solr; | |
| import java.util.HashSet; | |
| import java.util.Iterator; | |
| import java.util.List; | |
| import java.util.Map; | |
| import org.apache.lucene.index.Term; | |
| import org.apache.lucene.queryParser.ParseException; | |
| import org.apache.lucene.search.BooleanClause; |
NewerOlder