02 december 2011
A textscanning workflow
For a project that involved dozens of print pages to be saved as textdocuments, I used the Tesseract OCR program together with the Netpbm library.
Here's what the three step workflow looks like.
From print to an OpenOffice textdocument in three steps:
Note that a little command-line proficiency is required here.
1) scan all documents
The printed text consisted of A4 pages as well as small (A5) booklets.
The A4 pages were scanned in full, booklets two pages at a time.
Scanner settings: 300 dpi, B/W (1 bit), saved as .tif files with a sequencenumber suffixed filename.
2) process the .tif files in batch with this shell script. Adjust it to your specific needs, e.g. the pixelsize of the scan images.
3) heavily relying on the spellchecker, use OpenOffice to cleanup the raw .txt and save the final text as .odt
Et voila!
Software used
Tesseract 3.00, Macports package
XDialog 2.3.1, idem
Tesseract trained data
Copied to /opt/local/share/tessdata
Dutch
English
Hindi
Sample text (dutch)
The scanned original: (textsnippet taken from this lecture)

Tesseract OCR result:

The final text after editing:

Geplaatst op 02 december 2011 22:20 | Comments (0)
01 januari 2005
DTP with Scribus
For simple Desk Top Publishing jobs you can use Scribus. The (Dutch) OS X Shortcuts reference was created with Scribus version 1.3.
Geplaatst op 01 januari 2005 11:03 | Comments (0)