trying to OCR a simple tif file with tesseract-ocr
Nicolae Ghimbovschi
xfreebird at gmail.com
Sat Mar 26 12:43:19 UTC 2011
Hi,
Firstly try to pre-process the tiff file with unpaper, you'll need
netpbm utils to convert tiff to pgm and back to tiff
Stage 1:
$ tifftopnm -respectfillorder -verbose file.tiff > out.pgm
$ unpaper -v --layout single out.pgm unpaper_out.pgm
$ pnmtotiff unpaper_out.pgm > unpaper_out.tiff
Then ocr it with tesseract:
Stage 2:
$ tesseract unpaper_out.tiff ocr.txt -l eng
On Sat, Mar 26, 2011 at 14:30, Robert P. J. Day <rpjday at crashcourse.ca> wrote:
>
> as part of a larger workflow, i need to OCR process some simple text
> files and, as i read it, tesseract is the OCR tool of choice for
> ubuntu. so i'm following the instructions here:
>
> https://help.ubuntu.com/community/OCR
>
> but i can't seem to get any output. i've taken a B/W screenshot of
> some text, it's saved in a grayscale/1-layer .tif file, at which point
> i run:
>
> $ tesseract tess.tif output
>
> i get no diagnostics but the output file is always empty. does
> someone have a suggestion for an absurdly simple example of using
> tesseract to get text out of a tif file? i'm sure i'm missing
> something simple but critical.
>
> rday
>
> p.s. i haven't changed any of tesseract's config file info since the
> documentation *seems* to suggest i don't need to, but i'm willing to
> be corrected on that.
>
> --
>
> ========================================================================
> Robert P. J. Day Waterloo, Ontario, CANADA
> http://crashcourse.ca
>
> Twitter: http://twitter.com/rpjday
> LinkedIn: http://ca.linkedin.com/in/rpjday
> ========================================================================
>
> --
> ubuntu-users mailing list
> ubuntu-users at lists.ubuntu.com
> Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
>
More information about the ubuntu-users
mailing list