Art Alexion on 31 May 2007 14:54:29 -0000 |
On Thursday 31 May 2007 09:38, Joshua Karstendick wrote: > I've never used it myself, but I heard that Google released an OCR > software as open source, and it runs on Linux: > http://code.google.com/p/tesseract-ocr/ Thanks. I tried two approaches, tesseract and kooka. Neither were worth the effort on an 8 page contract. I haven't used OCR since the mid 90s, and it seems a decent typist is still more efficient at retyping the document than cleaning up the OCR output. Some comments, though. Tesseract is fast. Total command line interface. But it had two disadvantages in comparison to kooka. I think the source PDF was made by faxing the document to something that saved the fax image to PDF. There was no "text" in the PDF; only the image of text. Both processes involve invoking pdfimage on the PDF file. This creates a pbm for each page (or jpeg). This step seems flawless. Kooka had two advantages. First, tesseract only reads tiffs. There are a lot of command line programs for converting pbm to other formats, but none for converting to tiff. Hence, I had to open each page in the gimp and save each one to tiff. Kooka could read the pbm directly. Upon scanning the image kooka runs a spell check on the OCR output and you can interactively clean it up in the aspell interface. This would be great if it didn't find a problem with 75% of the words. Tesseract claims kudos for accuracy. In my experience, it is no more accurate than kooka (which uses the ocrad engine). For me, it will be quicker to open the PDF in an old copy of windows which includes Acrobat 4 and make comments there. A disappointment. > > On 5/31/07, Art Alexion <art.alexion@verizon.net> wrote: > > I need to make comments on a PDF I received. I first tried using > > pdftotext, but found that the text in PDF was a scanned image, so > > pdftotext found no text to convert. > > > > I am thinking of two possible solutions -- neither of which may exist. > > > > Many years ago, I used win-based fax software that scanned fax images for > > text and did an OCR. Does anything like that -- the ability to scan and > > OCR a pdf or extracted image -- exist for Linux? > > > > The other alternative is the ability to add comments as is possible with > > the full acrobat windows/mac product. Are there any tools available for > > Linux that allow commenting? > > -- > > > > _____________________________________________________________ > > Art Alexion > > > > PGP fingerprint: 52A4 B10C AA73 096F A661 92D2 3B65 8EAC ACC5 BA7A > > Keyserver: hkp://subkeys.pgp.net > > The attachment - signature.asc - is my electronic signature; no need for > > alarm. Info @ > > http://mysite.verizon.net/art.alexion/encryption/signature.asc.what.html > > _____________________________________________________________ > > > > _________________________________________________________________________ > >__ Philadelphia Linux Users Group -- > > http://www.phillylinux.org Announcements - > > http://lists.phillylinux.org/mailman/listinfo/plug-announce General > > Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug -- _____________________________________________________________ Art Alexion PGP fingerprint: 52A4 B10C AA73 096F A661 92D2 3B65 8EAC ACC5 BA7A Keyserver: hkp://subkeys.pgp.net The attachment - signature.asc - is my electronic signature; no need for alarm. Info @ http://mysite.verizon.net/art.alexion/encryption/signature.asc.what.html _____________________________________________________________ Attachment:
pgp4UTBxcVVVI.pgp ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|