Optical Character Recognition - OCR

From thelinuxwiki
Revision as of 06:06, 19 September 2014 by Nighthawk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Optical Character Recognition

software used

imagemagick (convert) version 6.8.8.10-r1

tesseract version 3.03_rc1

source file: PDF made up entirely of scanned images of a book.

destination format: text file


solution

step 1 - break up pdf into individual files using convert

The convert utility gets installed along with the imagemagick package.

$ convert -limit memory 1 -monitor -verbose -density 300 -colorspace Gray source_file.pdf output_file.png

The command above will not output just one file, but 1 file per page of the source pdf.

example...

output_file-1.png, output_file-2.png,output_file-3.png, etc...

step 2 - translate image text into plain text using OCR

tesseract will perform the OCR. We feed it the individual files and append the text to an output file. A bash loop will process all the output files for us.

$ ls -tr1 output_file*.png | while read line; do tesseract $line stdout >> output.txt; done

step 3 - cleanup

Invariably, we will have mistranslations by the OCR. For example, the document I was translating had "open quote" and "close quote" characters in it. However; ascii only has the double quote character. When viewing the text file using less, the hex codes for the non-ascii characters should be displayed. I would examine the file and identify patterns. If one had a reliable traslation, I would perform it using the line editor sed.