Skip to content

Instantly share code, notes, and snippets.

@peaeater
peaeater / png2jpg.ps1
Created November 10, 2014 23:57
Converts PNGs to JPGS with imagemagick.
# convert pngs to jpgs
# requires imagemagick
Param(
[int]$size = 1000,
[string]$indir = ".",
[string]$outdir = $indir
)
if (!(test-path $outdir)) {
@peaeater
peaeater / hocr.ps1
Created November 11, 2014 00:00
OCRs image file to text with coordinate info in hocr format with tesseract.
# ocr tif/png to hocr (html)
# requires tesseract
Param(
[string]$ext = "tif",
[string]$indir = ".",
[string]$outdir = $indir
)
if (!(test-path $outdir)) {
@peaeater
peaeater / raw-ocr.ps1
Created November 11, 2014 00:01
Converts PDFs to JPGs and OCRed text with imagemagick and tesseract.
<#
Processes raw source pdfs, producing per page: 1 txt, 1 hocr, 1 jpg.
Requires imagemagick w/ ghostscript, tesseract.
Subscripts: pdf2png.ps1, ocr.ps1, hocr.ps1, png2jpg.ps1
#>
param(
[string]$indir = ".",
[string]$outbase = $indir
@peaeater
peaeater / solr-dih-transform-order-sample.xml
Created November 12, 2014 19:58
Sample Solr DIH entity demonstrating the order in which transformers act.
<entity
name="sample"
transformer="RegexTransformer,TemplateTransformer">
<field column="test_ignored" template="BLAH" />
<field column="test_ignored" sourceColName="id" regex="(.+)" />
<!--
test_ignored will equal 'BLAH' because TemplateTransformer acts last,
even though it is written first.
@peaeater
peaeater / harvest2qbooks.ps1
Last active August 29, 2015 14:18
Convert Harvest timer CSV export to Quickbooks import format
<#
Peter Tyrrell, 2015
Convert Harvest time report to Quickbooks import format in Windows-1252.
#>
param (
[string]$indir = ".",
[string]$outdir = $indir
)
@peaeater
peaeater / ner.ps1
Last active November 25, 2015 16:17
Takes a text input file and by default, produces a tab-delimited csv output file. Output columns do not have a header row, but are always arranged the same way in three columns.
<#
Requires Stanford NER, Java 1.8+
formats = slashTags, inlineXML, xml, tsv, tabbedEntities
#>
param(
[Parameter(Mandatory=$true,Position=0)]
[string]$file,
[Parameter(Mandatory=$true,Position=1)]
@peaeater
peaeater / parsetsv.ps1
Created November 25, 2015 16:22
Takes a tab-delimited csv input file (tsv) produced by ner.ps1, and outputs a text file for each category found. A single category may be named, in which case a single output text file is created. If no output file is provided, results are written to the console instead.
<#
parse tsv
Categories: person, location, organization, misc, money, percent, date, time (depending on classifier used to produce the tsv)
Outfile: Results written to console if outfile not provided. If all categories, outfile is used as a filename template.
#>
param(
[Parameter(Mandatory=$true,Position=0)]
@peaeater
peaeater / gist:5810550
Created June 18, 2013 23:47
schema copy fields for suggest
<!-- suggest fields -->
<copyField source="title" dest="title_suggest" />
<copyField source="title" dest="title_suggest_edge" />
<copyField source="title" dest="title_suggest_ngram" />
<copyField source="title" dest="title_s" />
<copyField source="collection" dest="collection_suggest" />
<copyField source="collection" dest="collection_suggest_edge" />
<copyField source="collection" dest="collection_suggest_ngram" />
<copyField source="collection" dest="collection_s" />
<copyField source="universe" dest="universe_suggest" />
@peaeater
peaeater / gist:5810540
Created June 18, 2013 23:46
solrconfig.xml /suggest request handler
<!-- request handler to return typeahead suggestions -->
<requestHandler name="/suggest" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="defType">edismax</str>
<str name="rows">10</str>
<str name="fl">universe,collection,title,score</str>
<str name="qf">title_suggest^30 title_suggest_ngram^50.0 collection_suggest^15 collection_suggest_ngram^25.0</str>
<str name="pf">title_suggest_edge^50.0 collection_suggest_edge^25.0</str>
<str name="group">true</str>
@peaeater
peaeater / gist:5810559
Created June 18, 2013 23:48
text_suggest field type
<!-- text_suggest : Matches whole terms in the suggest text -->
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"