Skip to content

Instantly share code, notes, and snippets.

@peaeater
Created November 11, 2014 00:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save peaeater/e1f6a611f61b893ca560 to your computer and use it in GitHub Desktop.
Save peaeater/e1f6a611f61b893ca560 to your computer and use it in GitHub Desktop.
Converts PDFs to JPGs and OCRed text with imagemagick and tesseract.
<#
Processes raw source pdfs, producing per page: 1 txt, 1 hocr, 1 jpg.
Requires imagemagick w/ ghostscript, tesseract.
Subscripts: pdf2png.ps1, ocr.ps1, hocr.ps1, png2jpg.ps1
#>
param(
[string]$indir = ".",
[string]$outbase = $indir
)
function WebSafe([string]$s) {
return $s.ToLowerInvariant().Replace(" ", "-").Replace("&", "-")
}
$files = ls "$indir\*.*" -include *.pdf
foreach ($file in $files) {
# get metadata from filename
$split = $file.BaseName.Split("_")
$uni = $split[0]
$col = $split[1]
$title = $split[2]
# convert pdf to png per page
& .\pdf2png.ps1 -in $file.FullName
# convert png to txt
& .\ocr.ps1 -ext "png" -indir "$indir\png" -outdir "$indir\txt"
# convert png to xml (hocr)
& .\hocr.ps1 -ext "png" -indir "$indir\png" -outdir "$indir\xml"
# rename hocr extension .html => .xml
ls "$indir\xml\*.html" | foreach-object { ren $_.FullName ((join-path $_.DirectoryName $_.BaseName) + ".xml")}
# create path for jpgs
$uniSafe = WebSafe($uni)
$colSafe = WebSafe($col)
$titleSafe = WebSafe($title)
$jpgdir = "$outbase\$uniSafe\$colSafe\$titleSafe"
if (!(test-path $jpgdir)) {
mkdir $jpgdir
}
# convert .png to .jpg
& .\png2jpg.ps1 -size 1000 -indir "$indir\png" -outdir $jpgdir
# get page count
$pagecount = (ls "$jpgdir\*.*" -include *.jpg).Count
# write metadata to manifest.xml
$manifest = get-content ("$indir\manifest.xml")
$manifest | foreach-object {
$_ -replace '{UNIVERSE}', $uni `
-replace '{COLLECTION}', $col `
-replace '{TITLE}', $title `
-replace '{PAGECOUNT}', $pagecount `
-replace '{OCRTYPE}', 'hocr'
} | set-content "$indir\manifest.xml"
# clean up
rm "$indir\png" -recurse
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment