Skip to content

Instantly share code, notes, and snippets.

@JGVerdugo
Created February 9, 2014 18:55
Show Gist options
  • Save JGVerdugo/8904290 to your computer and use it in GitHub Desktop.
Save JGVerdugo/8904290 to your computer and use it in GitHub Desktop.
Extracts text from multiple documents using Apache Tika
import glob
import os
# USAGE:
# 1. Download the Tika command prompt tool from http://tika.apache.org/download.html.
# 2. Put some files in the same directory.
# 3. Put this script in the same directory (make sure you have Python).
# 4. In the command line, write "python dotika.py".
# If Tika can extract your files, a new file with the extension .new
# will be created for each file matching the "extension" filter (see
# the code below). This script does nothing but automating the extraction
# process.
#
# Here are the default values.
# If you need a different format or encoding, change these values.
# Be sure to read this first: http://tika.apache.org/1.4/gettingstarted.html
# (especially the "Using Tika as a command line utility").
encoding = "UTF-8"
outputformat = "--text"
extension = "*.doc"
files = glob.glob(extension)
for file in files:
newfile = file + ".new"
print newfile
os.system("java -jar tika-app-1.4.jar %s --encoding=%s %s > %s" % (outputformat, encoding, file, newfile))
@JGVerdugo
Copy link
Author

I wrote this simple script to help colleagues take advantage of the Tika functionality without having to use a programming language. Since none of us are developers (we are linguists), I wanted to keep it as simple and self-explanatory as possible.

He escrito este sencillo script para que mis compañeros puedan aprovechar la funcionalidad de Tika sin necesidad de programar. Dado que ninguno de nosotros es programador (somos lingüistas), me ha parecido conveniente usar una estructura lo más sencilla y clara posible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment