This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# convert pngs to jpgs | |
# requires imagemagick | |
Param( | |
[int]$size = 1000, | |
[string]$indir = ".", | |
[string]$outdir = $indir | |
) | |
if (!(test-path $outdir)) { |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# ocr tif/png to hocr (html) | |
# requires tesseract | |
Param( | |
[string]$ext = "tif", | |
[string]$indir = ".", | |
[string]$outdir = $indir | |
) | |
if (!(test-path $outdir)) { |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<# | |
Processes raw source pdfs, producing per page: 1 txt, 1 hocr, 1 jpg. | |
Requires imagemagick w/ ghostscript, tesseract. | |
Subscripts: pdf2png.ps1, ocr.ps1, hocr.ps1, png2jpg.ps1 | |
#> | |
param( | |
[string]$indir = ".", | |
[string]$outbase = $indir |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<entity | |
name="sample" | |
transformer="RegexTransformer,TemplateTransformer"> | |
<field column="test_ignored" template="BLAH" /> | |
<field column="test_ignored" sourceColName="id" regex="(.+)" /> | |
<!-- | |
test_ignored will equal 'BLAH' because TemplateTransformer acts last, | |
even though it is written first. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<# | |
Peter Tyrrell, 2015 | |
Convert Harvest time report to Quickbooks import format in Windows-1252. | |
#> | |
param ( | |
[string]$indir = ".", | |
[string]$outdir = $indir | |
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<# | |
Requires Stanford NER, Java 1.8+ | |
formats = slashTags, inlineXML, xml, tsv, tabbedEntities | |
#> | |
param( | |
[Parameter(Mandatory=$true,Position=0)] | |
[string]$file, | |
[Parameter(Mandatory=$true,Position=1)] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<# | |
parse tsv | |
Categories: person, location, organization, misc, money, percent, date, time (depending on classifier used to produce the tsv) | |
Outfile: Results written to console if outfile not provided. If all categories, outfile is used as a filename template. | |
#> | |
param( | |
[Parameter(Mandatory=$true,Position=0)] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!-- suggest fields --> | |
<copyField source="title" dest="title_suggest" /> | |
<copyField source="title" dest="title_suggest_edge" /> | |
<copyField source="title" dest="title_suggest_ngram" /> | |
<copyField source="title" dest="title_s" /> | |
<copyField source="collection" dest="collection_suggest" /> | |
<copyField source="collection" dest="collection_suggest_edge" /> | |
<copyField source="collection" dest="collection_suggest_ngram" /> | |
<copyField source="collection" dest="collection_s" /> | |
<copyField source="universe" dest="universe_suggest" /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!-- request handler to return typeahead suggestions --> | |
<requestHandler name="/suggest" class="solr.SearchHandler"> | |
<lst name="defaults"> | |
<str name="echoParams">explicit</str> | |
<str name="defType">edismax</str> | |
<str name="rows">10</str> | |
<str name="fl">universe,collection,title,score</str> | |
<str name="qf">title_suggest^30 title_suggest_ngram^50.0 collection_suggest^15 collection_suggest_ngram^25.0</str> | |
<str name="pf">title_suggest_edge^50.0 collection_suggest_edge^25.0</str> | |
<str name="group">true</str> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!-- text_suggest : Matches whole terms in the suggest text --> | |
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100"> | |
<analyzer type="index"> | |
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> | |
<tokenizer class="solr.StandardTokenizerFactory"/> | |
<filter class="solr.WordDelimiterFilterFactory" | |
generateWordParts="1" | |
generateNumberParts="1" | |
catenateWords="1" | |
catenateNumbers="1" |