A textscanning workflow

For a project that involved dozens of print pages to be saved as textdocuments, I used the Tesseract OCR program together with the Netpbm library. Here’s what the three step workflow looks like.

From print to an OpenOffice textdocument in three steps:
Note that a little command-line proficiency is required here.

  1. scan all documents
    The printed text consisted of A4 pages as well as small (A5) booklets.
    The A4 pages were scanned in full, booklets two pages at a time. Scanner settings: 300 dpi, B/W (1 bit), saved as .tif files with a sequencenumber suffixed filename

  2. process the .tif files in batch with this shell script. Adjust it to your specific needs, e.g. the pixelsize of the scan images.

  3. heavily relying on the spellchecker, use OpenOffice to cleanup the raw .txt and save the final text as .odt

Et voila!

Software used
Tesseract 3.00, Macports package
XDialog 2.3.1, idem

Tesseract trained data
Copied to /opt/local/share/tessdata

Sample text (dutch)
The scanned original: (textsnippet taken from this lecture)

Scanned original

Tesseract OCR result:

Raw text

The final text after editing:

Final result