{"id":41,"date":"2011-12-02T22:20:03","date_gmt":"2011-12-02T21:20:03","guid":{"rendered":"http:\/\/localhost\/~johan\/wordpress\/?p=41"},"modified":"2014-07-20T20:48:25","modified_gmt":"2014-07-20T18:48:25","slug":"a-textscanning-workflow","status":"publish","type":"post","link":"https:\/\/www.chaosgeordend.nl\/wordpress\/2011\/12\/02\/a-textscanning-workflow\/","title":{"rendered":"A textscanning workflow"},"content":{"rendered":"<p>For a project that involved dozens of print pages to be saved as textdocuments, I used the <a href=\"https:\/\/code.google.com\/p\/tesseract-ocr\/\">Tesseract OCR<\/a> program together with the <a href=\"https:\/\/netpbm.sourceforge.net\/\">Netpbm<\/a> library. Here&#8217;s what the three step workflow looks like.<\/p>\n<p><!--more--><\/p>\n<p>From print to an OpenOffice textdocument in three steps:<br \/>\nNote that a little command-line proficiency is required here.<\/p>\n<ol>\n<li>\n<p>scan all documents<br \/>\nThe printed text consisted of A4 pages as well as small (A5) booklets.<br \/>\nThe A4 pages were scanned in full, booklets two pages at a time. Scanner settings: 300 dpi, B\/W (1 bit), saved as .tif files with a  <a href=\"https:\/\/www.chaosgeordend.nl\/mt-blog-cg\/images\/Example_Scan_Img_001.tif\">sequencenumber suffixed filename<\/a><\/p>\n<li>\n<p>process the .tif files in batch with this <a href=\"https:\/\/www.chaosgeordend.nl\/documents\/Tesseract OCR.command\">shell script<\/a>. Adjust it to your  specific needs, e.g. the pixelsize of the scan images.<\/p>\n<li>\n<p>heavily relying on the spellchecker, use OpenOffice to cleanup the raw .txt and save the final text as .odt<\/p>\n<\/ol>\n<p>Et voila!<\/p>\n<p><strong>Software used<\/strong><br \/>\nTesseract 3.00, <a href=\"https:\/\/www.macports.org\/ports.php\">Macports<\/a> package<br \/>\nXDialog 2.3.1, idem<\/p>\n<p><strong>Tesseract trained data<\/strong><br \/>\nCopied to \/opt\/local\/share\/tessdata<br \/>\n<a href=\"https:\/\/code.google.com\/p\/tesseract-ocr\/downloads\/detail?name=nld.traineddata.gz&#038;can=2&#038;q=\">Dutch<\/a><br \/>\n<a href=\"https:\/\/code.google.com\/p\/tesseract-ocr\/downloads\/detail?name=eng.traineddata.gz&#038;can=2&#038;q=\">English<\/a><br \/>\n<a href=\"https:\/\/code.google.com\/p\/tesseract-ocr\/downloads\/detail?name=tesseract-ocr-3.01.hin.tar.gz&#038;can=2&#038;q=\">Hindi<\/a><\/p>\n<p><strong>Sample text<\/strong> (dutch)<br \/>\nThe scanned original: (textsnippet taken from this <a href=\"https:\/\/www.home-academy.nl\/Webshop\/Product\/34?productCategoryId=3\">lecture<\/a>)<\/p>\n<div id=\"entry_img\"><img decoding=\"async\" alt=\"Scanned original\" src=\"https:\/\/www.chaosgeordend.nl\/mt-blog-cg\/images\/Example_Print.jpg\" \/><\/div>\n<p>Tesseract OCR result:<\/p>\n<div id=\"entry_img\"><img decoding=\"async\" alt=\"Raw text\" src=\"https:\/\/www.chaosgeordend.nl\/mt-blog-cg\/images\/Example_TXT.png\" \/><\/div>\n<p>The final text after editing:<\/p>\n<div id=\"entry_img\"><img decoding=\"async\" alt=\"Final result\" src=\"https:\/\/www.chaosgeordend.nl\/mt-blog-cg\/images\/Example_ODT.png\" \/><\/div>\n","protected":false},"excerpt":{"rendered":"<p>For a project that involved dozens of print pages to be saved as textdocuments, I used the Tesseract OCR program together with the Netpbm library. Here&#8217;s what the three step workflow looks like.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,5,6,13],"tags":[],"class_list":["post-41","post","type-post","status-publish","format-standard","hentry","category-devanagari","category-dtp","category-oss","category-taallanguage"],"_links":{"self":[{"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/posts\/41","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/comments?post=41"}],"version-history":[{"count":11,"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/posts\/41\/revisions"}],"predecessor-version":[{"id":118,"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/posts\/41\/revisions\/118"}],"wp:attachment":[{"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/media?parent=41"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/categories?post=41"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.chaosgeordend.nl\/wordpress\/wp-json\/wp\/v2\/tags?post=41"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}