Skip to content

Instantly share code, notes, and snippets.

@anarchivist
Created June 17, 2009 21:08
Show Gist options
  • Save anarchivist/131507 to your computer and use it in GitHub Desktop.
Save anarchivist/131507 to your computer and use it in GitHub Desktop.
Drupal function to extract text using Tika deployed from within Solr
/**
* Extracts text using Tika deployed from within Solr. Assumes the following
* about your Solr config:
* - ExtractingRequestHandler lives at (solr URL)/extract, not /update/extract
* - ExtractingRequestHandler is set to extract only (ext.extract.only=true)
* - ExtractingRequestHandler only returns text within the body tags of the
* XHTML response (ext.xpath=/xhtml:html/xhtml:body/descendant:node())
*
* @param $path
* string containing path of file to have text extracted
*
* @return string
* XHTML string containing text extracted from document
*/
function apachesolr_extract_text($path) {
$headers = array('Content-type' => 'application/octet-stream');
$solr_url = 'http://'. variable_get('apachesolr_host', 'localhost') .':'
. variable_get('apachesolr_port', '8983')
. variable_get('apachesolr_path', '/solr') . '/extract';
$rsp = drupal_http_request($solr_url, $headers, 'POST',
file_get_contents($path));
if ($rsp->code != 200) {
$msg = "HTTP %code error posting file %path to Solr server at %solr.";
$vars = array('%code' => $rsp->code, '%path' => $path, '%solr' => $solr_url);
watchdog('Apache Solr', $msg, $vars, WATCHDOG_WARNING);
$return = '';
} else {
$xmldata = simplexml_load_string($rsp->data);
$extract = $xmldata->str;
$return = str_replace('<?xml version="1.0" encoding="UTF-8"?>', '', $extract);
}
return $return;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment