03 April 2013

Currently I am working on task to scan documents to PDFs and retrieve their content. This article explains how you do it if you do not have a searchable PDF. The following command have been evaluated on Ubuntu Linux 12.10 and will most likely work on any other Debian based distribution.

Step #1 - Install Tesseract

sudo apt-get install tesseract-ocr

Step #2 - Create a simple multi page PDF

To do so I have use Libre Office Writer and saved the document as PDF. Make sure the document contains the language you try to capture using OCR.

Step #3 - Use ghostscript to convert the PDF into a multipage TIFF

gs -o multipage-tiffg4.tif -sDEVICE=tiffg4 multipage-input.pdf

Step #4 - Run Tesseract

The following tells Tesseract to scan the TIFF called multipage-tiffg4.tif using an English dictionary and store the captured output in a file called multipage-tiffg4-ocr-capture.txt. The .txt was is added by Tesseract itself.

tesseract multipage-tiffg4.tif multipage-tiffg4-ocr-capture -l en

Step #5 - Review the result

You made it! Enjoy the result

Tags: linux, pdf, tiff, ubuntu