Skip to content

Instantly share code, notes, and snippets.

@shantanusingh
Last active September 19, 2023 10:10
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save shantanusingh/6526664 to your computer and use it in GitHub Desktop.
Save shantanusingh/6526664 to your computer and use it in GitHub Desktop.
Tesseract on Amazon-AMI

sudo yum update

##Install Redis https://gist.github.com/dstroot/2776679

wget https://raw.github.com/gist/2776679/04ca3bbb9f085b192f6aca945120fe12d59f15f9/install-redis.sh
chmod 777 install-redis.sh
./install-redis.sh

##Install Node JS sudo yum install gcc-c++ make
sudo yum install openssl-devel
sudo yum install git

cd ~
mkdir libs && cd libs
git clone git://github.com/joyent/node.git
cd node
git checkout v0.10.8
./configure
make
sudo make install
###Edit /etc/sudoers to add /usr/local/bin path:
sudo nano /etc/sudoers

... Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin ... ###Verify node -v
npm -v

##Tesseract

sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel

##Install Leptonica cd ~/libs mkdir leptonica && cd leptonica
wget http://www.leptonica.com/source/
leptonica-1.69.tar.gz
tar -zxvf leptonica-1.69.tar.gz
cd leptonica-1.69
./configure
make # Takes ~20 minutes on T1 Micro Instance machine
sudo make install

##Install Tesseract cd ~/libs mkdir tesseract && cd tesseract
wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
tar -zxvf tesseract-ocr-3.02.02.tar.gz
cd tesseract-ocr
./autogen.sh
./configure
make # Takes ~40 minutes on T1 Micro Instance machine
sudo make install
sudo ldconfig

Tesseract training data

cd /usr/local/share/tessdata
sudo wget http://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz
sudo tar xvf tesseract-ocr-3.02.eng.tar.gz
export TESSDATA_PREFIX=/usr/local/share/
sudo mv tesseract-ocr/tessdata/* .

Source TESSERACT_PREFIX

vi ~/.bash_profile
####Copy this line to the end
export TESSDATA_PREFIX=/usr/local/share/

###Verify tesseract

@PavanCheruvu
Copy link

Hi Shantanu,
I want to upload pdf to AWS Clousearch directly for searching for text. Do I have to convert to XML before uploading ?
Regards, Pavan

@MattBrauer
Copy link

Sources seem to be defunct.

@vivianamarquez
Copy link

Source links no longer work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment