Skip to content

Instantly share code, notes, and snippets.

@kordless
Last active August 18, 2021 18:43
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save kordless/f63a6ec9a5bfde81b811d439282ef2d2 to your computer and use it in GitHub Desktop.
Save kordless/f63a6ec9a5bfde81b811d439282ef2d2 to your computer and use it in GitHub Desktop.
Indexing XKCD with Lucidwork's Fusion and Google Image API

Overview

This Seed Streams guide illustrates how to use Lucidworks Fusion to crawl a specific set of documents on a website whose URIs match a regular expression. Additionally, img src fields are extracted with a JavaScript parsing stage and inserted into the index for use in other indexing stages. A vision network may be utilized to extract additional fields from the images.

Start Fusion and Create a New Appliction

  1. Start a Fusion instance on Google. Click the link the script outputs to navigate to the Fusion instance page. Set a password. Login with admin and the new password.
  2. Create a new application. Call it XKCD.
  3. Click on the new application.

Add a New Datasource and Limit the Documents

  1. Create a new datasource under Indexing..Datasources. Add a web source. Add https://xkcd.com as a start link. Limit documents to max 200. Click save at top right.
  2. Navigate to Indexing..Index Pipelines. Add a new Javascript pipeline stage (under advanced). Copy and paste in the javascript_indexing_pipeline_stage.js code below into the script body. Click save.
  3. Add a new Include Documents stage. Add a new field, 'id' and set the regex pattern to .*/[0-9]{1,5}/*. and click save. This limits the documents to comics pages which appear in the format https://xkcd.com/501/, https://xkcd.com/4/, etc.

Configure the Parsers

  1. Ensure there is NO Tika parser in the Index Pipeline. You'll use a parser stage for Tika.
  2. Navigate to Indexing...Index Workbench. Remove all but the Tika parser and fallback from the XKCD datasource by clicking on stage, then remove stage below. Repeat until all but Tika and fallback remain. Click save.
  3. Click on the Tika parser stage and check Return parsed content as XML or HTML and Return original XML and HTML instead of Tika XML output. Click apply below. Click save at top right.

Start the Crawl

  1. Navigate to Indexing..Datasources, click run and then the start button. The crawler will start and then complete in about 30 seconds.
  2. Navigate to Querying..Query Workbench. Set the display fields to be id and img_url_s.
  3. Run a search and ensure the image_url_s and image_url_t fields are present.

Add Vision

  1. Note that the text of the comic is already available in the <div id="transcript"> tag on the comic page. Google's Vision API returns other data about images, however.
  2. Navigate to Indexing..Index Pipelines. Add a new REST Query pipeline stage. Set the Endpoint URI to https://vision.googleapis.com/v1/images:annotate. Change the call method to post.
  3. Create a query parameter with a property name of key. Set the property value to your Google API key for the Vision API.
  4. Copy and paste in the request_entity_indexing_pipeline.json string below into the request entity field.
  5. Add a mapping of returned values XPath Expression. Use //responses/fullTextAnnotation/text for the first expression. Set the target field to be gv_text_s. Click Append To Existing Values In Target Field. Click save at top.
  6. Navigate to Indexing..Datasources, click clear datasource then run..start to restart the crawl.
  7. Run a search and ensure the gv_text_s field is present.

Creating a Google Vision API Key

  1. Navigate to the Credentials dashboard. You may need to select the correct project.
  2. Click the create credentials button. Select API key. Copy the API key when it appears.
  3. Click restrict key. In the application restrictions tab, select IP addresses. Enter the just IP address of the Fusion instance from your browser, without the port number or colon.
  4. Click the API restrictions tab. Set the API restrictions to Cloud Vision API.
  5. Click save.

Debugging

Tail the connectors-classic.log in the ./fusion/4.0.2/var/log/connectors/connectors-classic directory to debug:

$ cd /fusion/4.0.2/var/log/connectors/connectors-classic
$ tail -f connectors-classic.log
function(doc){
var File = java.io.File;
var Iterator = java.util.Iterator;
var Jsoup = org.jsoup.Jsoup;
var Document = org.jsoup.nodes.Document;
var Element = org.jsoup.nodes.Element;
var Elements = org.jsoup.select.Elements;
var content = doc.getFirstFieldValue("body");
var jdoc = org.jsoup.nodes.Document;
var e = java.lang.Exception;
var div = org.jsoup.nodes.Element;
var img = org.jsoup.nodes.Element;
var iter = java.util.Iterator;
var divs = org.jsoup.select.Elements;
try {
jdoc = Jsoup.parse(content);
divs = jdoc.select("div");
iter = divs.iterator();
div = null; // initialize our value to null
while (iter.hasNext()) {
div = iter.next();
if (div.attr("id").equals("bottom")) {
// found the containing div of img
break; // break out to there
}
}
// break out to here to add field for img src
if (div != null) {
img = div.child(0); // get the image element
logger.info("SRC: " + img.attr("src")); // log the image URL
doc.addField("image_url", img.attr("src"));
} else {
logger.warn("div was null");
}
} catch ( e) {
logger.warn("something went wrong");
logger.error(e);
}
return doc;
}
{
"requests": [{
"image":{
"source":{
"imageUri":
"${image_url_s}"
}
},
"features": [
{ "type": "TEXT_DETECTION", "maxResults": 50 }
]
}]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment