trying to OCR a simple tif file with tesseract-ocr

Sat Mar 26 12:43:19 UTC 2011

Hi,

Firstly try to pre-process the tiff file with unpaper, you'll need
netpbm utils to convert tiff to pgm and back to tiff

Stage 1:

$  tifftopnm -respectfillorder -verbose file.tiff > out.pgm
$  unpaper -v --layout single out.pgm unpaper_out.pgm
$  pnmtotiff unpaper_out.pgm > unpaper_out.tiff

Then ocr it with tesseract:

Stage 2:
$ tesseract unpaper_out.tiff ocr.txt -l eng

On Sat, Mar 26, 2011 at 14:30, Robert P. J. Day <rpjday at crashcourse.ca> wrote:
>
>  as part of a larger workflow, i need to OCR process some simple text
> files and, as i read it, tesseract is the OCR tool of choice for
> ubuntu.  so i'm following the instructions here:
>
>  https://help.ubuntu.com/community/OCR
>
> but i can't seem to get any output.  i've taken a B/W screenshot of
> some text, it's saved in a grayscale/1-layer .tif file, at which point
> i run:
>
>  $ tesseract tess.tif output
>
> i get no diagnostics but the output file is always empty.  does
> someone have a suggestion for an absurdly simple example of using
> tesseract to get text out of a tif file?  i'm sure i'm missing
> something simple but critical.
>
> rday
>
> p.s.  i haven't changed any of tesseract's config file info since the
> documentation *seems* to suggest i don't need to, but i'm willing to
> be corrected on that.
>
> --
>
> ========================================================================
> Robert P. J. Day                               Waterloo, Ontario, CANADA
>                        http://crashcourse.ca
>
> Twitter:                                       http://twitter.com/rpjday
> LinkedIn:                               http://ca.linkedin.com/in/rpjday
> ========================================================================
>
> --
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
>