Art Alexion on 31 May 2007 14:54:29 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] pdf to text


On Thursday 31 May 2007 09:38, Joshua Karstendick wrote:
> I've never used it myself, but I heard that Google released an OCR
> software as open source, and it runs on Linux:
> http://code.google.com/p/tesseract-ocr/

Thanks.  I tried two approaches, tesseract and kooka.  Neither were worth the 
effort on an 8 page contract.  I haven't used OCR since the mid 90s, and it 
seems a decent typist is still more efficient at retyping the document than 
cleaning up the OCR output.

Some comments, though.

Tesseract is fast.  Total command line interface.  But it had two 
disadvantages in comparison to kooka.

I think the source PDF was made by faxing the document to something that saved 
the fax image to PDF.  There was no "text" in the PDF; only the image of 
text.  Both processes involve invoking pdfimage on the PDF file.  This 
creates a pbm for each page (or jpeg).  This step seems flawless.

Kooka had two advantages.  First, tesseract only reads tiffs.  There are a lot 
of command line programs for converting pbm to other formats, but none for 
converting to tiff.  Hence, I had to open each page in the gimp and save each 
one to tiff.  Kooka could read the pbm directly.

Upon scanning the image kooka runs a spell check on the OCR output and you can 
interactively clean it up in the aspell interface.  This would be great if it 
didn't find a problem with 75% of the words.

Tesseract claims kudos for accuracy. In my experience, it is no more accurate 
than kooka (which uses the ocrad engine).

For me, it will be quicker to open the PDF in an old copy of windows which 
includes Acrobat 4 and make comments there.  A disappointment.
>
> On 5/31/07, Art Alexion <art.alexion@verizon.net> wrote:
> > I need to make comments on a PDF I received.  I first tried using
> > pdftotext, but found that the text in PDF was a scanned image, so
> > pdftotext found no text to convert.
> >
> > I am thinking of two possible solutions -- neither of which may exist.
> >
> > Many years ago, I used win-based fax software that scanned fax images for
> > text and did an OCR.  Does anything like that -- the ability to scan and
> > OCR a pdf or extracted image -- exist for Linux?
> >
> > The other alternative is the ability to add comments as is possible with
> > the full acrobat windows/mac product.  Are there any tools available for
> > Linux that allow commenting?
> > --
> >
> > _____________________________________________________________
> > Art Alexion
> >
> > PGP fingerprint: 52A4 B10C AA73 096F A661  92D2 3B65 8EAC ACC5 BA7A
> > Keyserver: hkp://subkeys.pgp.net
> > The attachment - signature.asc - is my electronic signature; no need for
> > alarm.  Info @
> > http://mysite.verizon.net/art.alexion/encryption/signature.asc.what.html
> > _____________________________________________________________
> >
> > _________________________________________________________________________
> >__ Philadelphia Linux Users Group         --       
> > http://www.phillylinux.org Announcements -
> > http://lists.phillylinux.org/mailman/listinfo/plug-announce General
> > Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

-- 

_____________________________________________________________
Art Alexion

PGP fingerprint: 52A4 B10C AA73 096F A661  92D2 3B65 8EAC ACC5 BA7A
Keyserver: hkp://subkeys.pgp.net
The attachment - signature.asc - is my electronic signature; no need for 
alarm.  Info @ 
http://mysite.verizon.net/art.alexion/encryption/signature.asc.what.html
_____________________________________________________________

Attachment: pgp4UTBxcVVVI.pgp
Description: PGP signature

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug