Difference between revisions of "Optical Character Recognition - OCR"

From thelinuxwiki
Jump to: navigation, search
(Created page with "=Optical Character Recognition= == software used == imagemagick (convert) version 6.8.8.10-r1 tesseract version 3.03_rc1 source file: PDF made up entirely of scanned ima...")
 
Line 12: Line 12:
  
 
destination format: text file
 
destination format: text file
 +
 +
== problem ==
 +
 +
I had scanned images of books that I wanted in electronic, text format.  They were manuals for a computer course I was taking.  The pwd
 +
  
  

Revision as of 06:14, 19 September 2014

Contents

Optical Character Recognition

software used

imagemagick (convert) version 6.8.8.10-r1

tesseract version 3.03_rc1

source file: PDF made up entirely of scanned images of a book.

destination format: text file

problem

I had scanned images of books that I wanted in electronic, text format. They were manuals for a computer course I was taking. The pwd


solution

step 1 - break up pdf into individual files using convert

The convert utility gets installed along with the imagemagick package.

$ convert -limit memory 1 -monitor -verbose -density 300 -colorspace Gray source_file.pdf output_file.png

The command above will not output just one file, but 1 file per page of the source pdf.

example...

output_file-1.png, output_file-2.png,output_file-3.png, etc...

step 2 - translate image text into plain text using OCR

tesseract will perform the OCR. We feed it the individual files and append the text to an output file. A bash loop will process all the output files for us.

$ ls -tr1 output_file*.png | while read line; do tesseract $line stdout >> output.txt; done

step 3 - cleanup

Invariably, we will have mistranslations by the OCR. For example, the document I was translating had "open quote" and "close quote" characters in it. However; ascii only has the double quote character. When viewing the text file using less, the hex codes for the non-ascii characters should be displayed. I would examine the file and identify patterns. If one had a reliable traslation, I would perform it using the line editor sed.