Optical Character Recognition - OCR

From thelinuxwiki
Jump to: navigation, search

Contents

Optical Character Recognition

software used

imagemagick (convert) version 6.8.8.10-r1

tesseract version 3.03_rc1

source file: PDF made up entirely of scanned images of a book.

destination format: text file

problem

I had scanned images of books that I wanted in electronic, text format. They were manuals for a computer course I was taking. The company offering the course would not provide a pdf copy of the book. So, I found one someone else had scanned into a pdf. The pdf was entirely images, no text. They probably just used the format because pdf supports multipage documents. I needed the electronic copy. When taking the course, I wanted to be able to copy and paste commands into my terminal. Also, I wanted to copy and paste into my wiki notes.

solution

step 1 - break up pdf into individual files using convert

The convert utility gets installed along with the imagemagick package.

$ convert -limit memory 1 -monitor -verbose -density 300 -colorspace Gray source_file.pdf output_file.png

The command above will not output just one file, but 1 file per page of the source pdf.

example...

output_file-1.png, output_file-2.png,output_file-3.png, etc...

step 2 - translate image text into plain text using OCR

tesseract will perform the OCR. We feed it the individual files and append the text to an output file. A bash loop will process all the output files for us.

$ ls -tr1 output_file*.png | while read line; do tesseract $line stdout >> output.txt; done

step 3 - cleanup

Invariably, we will have mistranslations by the OCR. For example, the document I was translating had "open quote" and "close quote" characters in it. However; ascii only has the double quote character. When viewing the text file using less, the hex codes for the non-ascii characters should be displayed. I would examine the file and identify patterns. If one had a reliable traslation, I would perform it using the line editor sed.