02 december 2011

A textscanning workflow

For a project that involved dozens of print pages to be saved as textdocuments, I used the Tesseract OCR program together with the Netpbm library.
Here's what the three step workflow looks like.

From print to an OpenOffice textdocument in three steps:
Note that a little command-line proficiency is required here.

1) scan all documents
The printed text consisted of A4 pages as well as small (A5) booklets.
The A4 pages were scanned in full, booklets two pages at a time.
Scanner settings: 300 dpi, B/W (1 bit), saved as .tif files with a sequencenumber suffixed filename.

2) process the .tif files in batch with this shell script. Adjust it to your specific needs, e.g. the pixelsize of the scan images.

3) heavily relying on the spellchecker, use OpenOffice to cleanup the raw .txt and save the final text as .odt

Et voila!


Software used
Tesseract 3.00, Macports package
XDialog 2.3.1, idem

Tesseract trained data
Copied to /opt/local/share/tessdata
Dutch
English
Hindi

Sample text (dutch)
The scanned original: (textsnippet taken from this lecture)

Scanned original

Tesseract OCR result:

Raw text

The final text after editing:

Final result

Geplaatst op 02 december 2011 22:20 | Comments (0)

02 januari 2006

Teksteditor Yudit


om purnamadah

To edit my Hindi words list, I use open-source texteditor Yudit. Though a plain, Spartan, interface, it does the job of typing Devanagari text.

As with most ported stuff, Yudit was installed using MacPorts.

2011 Update
Compile Yudit from source. Source download from www.yudit.org



$ cd /opt/local # use Macports' location
$ cp -R your_src_location/yudit-2.9.2 .
$ ./configure prefix=/opt/local/yudit-2.9.2
$ make
$ make install
$ cd bin
$ ./yudit # check


Geplaatst op 02 januari 2006 12:21

24 oktober 2005

Hindi woordenlijstje

Back from Mussoorie Landour Language school. Copied the Hari Kitaab (Green Book) vocabulary lists up to and including lesson 18. Typed in Yudit and pasted into an OpenOffice document.

Download the Hindi vocabulary (PDF 135KB)

Geplaatst op 24 oktober 2005 05:28 | Comments (0)