Portable scanner to text for ubuntu 12.04

Marius Gedminas marius at pov.lt
Fri Jan 24 09:37:48 UTC 2014


On Thu, Jan 23, 2014 at 11:11:43AM -0500, John Hupp wrote:
> To convert that to text, you need OCR.  I do not currently know of
> any functional open source OCR software for Linux.  But I would be
> happy to be told that there is some.

The state of open-source OCR software is pitiful.  The least bad
solution I found was Tesseract:

  sudo apt-get install tesseract-ocr-eng

Now it only accepts TIFF files as input, so you need to convert

  convert scanned-page.jpg scanned-page.tif

(My notes also indicate that I used GIMP for scanning and manually
cropped/rotated/converted the image to black and white before saving and
converting to TIFF.  I don't know if Tesseract can handle images that
aren't black and white.  Probably not.)

Anyway, when you have a B&W TIFF image, you can do OCR on it:

  tesseract scanned-page.tif output-file

You'll end up with an output-file.txt that has the text (full of OCR
mistakes you need to fix manually).

Sad.

If there's anything better out there, I'd love to know.

Marius Gedminas
-- 
The difference between Microsoft and 'Jurassic Parc':
In one, a mad businessman makes a lot of money with beasts that should be
extinct.
The other is a film.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
URL: <https://lists.ubuntu.com/archives/ubuntu-users/attachments/20140124/4d747fd1/attachment.sig>


More information about the ubuntu-users mailing list