Benjamin Eikel's homepageProjects

Comparison of free OCR software

Abstract

This page presents the results of comparing three free OCR programs in terms of accuracy of recognition and processing time. The tools had to analyze scanned images of a simple text without formatting. All three programs are command-line utilities which were installed and used under Debian GNU/Linux.

Table of contents

Test setup

In this comparison the programs GOCR (package version 0.46-2), Ocrad (package version 0.17-3) and Tesseract (package version 2.03-2) were used. The software was installed using the Debian GNU/Linux sid packages.

The GNU General Public License version 2 was used as text. The first DIN A4 page of the license was printed using Kate with a Canon PIXMA iP4000 and scanned using XSane with an AGFA SnapScan e25 at three different resolutions as gray scale image.

Test images
Original text 150 dpi 300 dpi 600 dpi

For the execution of the tests a shell script was used, in which you can find the parameters for the different programs. The measurement of the processing time was done by the program time. To determine the accuracy of recognition the text files created were compared to the original text with dwdiff.

Accuracy of recognition

The program dwdiff was used to compare the results produced by the OCR software to the original text on a word level. The output was converted to XHTML and the incorrect words were highlighted in red. The accuracy of recognition is the ratio of correctly recognized words to the number of words in the text. The table contains the results together with links to the formatted output. The following diagram visualizes the data from the table.

Image resolution
150 dpi 300 dpi 600 dpi
GOCR 80% 82% 65%
Ocrad 69% 83% 83%
Tesseract 98% 98% 98%

Accuracy of recognition

Processing time

The processing time of the OCR programs was measured with the command time. The times can be seen in the following diagram.

Processing time

Conclusion

Only Tesseract was able to convert the image to text without too many errors. GOCR and Ocrad performed not very well and created unusable text in some cases. The resolution of the image had only little to no impact. Possibly Ocrad benefits from higher resolutions; GOCR performed worst on the highest resolution. Ocrad was the fastest tool tested, but you should invest the time better and use Tesseract to analyze images.

Copyright 2007-2012 Benjamin EikelLast modification: 2008-11-24T18:21:09+01:00Switch language: Deutsch English