| Jeff Abrahamson on 30 Jul 2007 08:47:31 -0000 |
|
On Sun, Jul 29, 2007 at 06:20:55PM -0400, TuskenTower wrote:
> [45 lines, 264 words, 1911 characters] Top characters: _tonieal
>
> Jeff, I don't have a suggestion, but do you mind sharing your PERL
> and lex magic? I have looked at PDFs briefly to do some text
> extraction but couldn't find what I wanted. So I manually copied
> data out in 3 hrs.
I've attached the perl script that I wrote in two minutes. Much
easier. ;-)
The key (an additional ten minutes) is to know that notes are in
blocks that look like "Contents(note goes here)". I found this by
copying the pdf to a Mac, looking at it in acrobat, then, back on my
linux box, I emacsed the pdf and searched for some of the comment
strings that I had read on the Mac. The pattern was quickly obvious.
From there, writing the perl took < 2 min.
As noted, I didn't write the lex version. I'd rather check the odd
note that comes out malformed because of a close paren.
Contents(Look at the cat (fido) first.)
==>
Look at the cat (fido
and I'd know to look further.
On Sun, Jul 29, 2007 at 08:03:56PM -0400, Mag Gam wrote:
> [62 lines, 342 words, 2148 characters] Top characters: _toneias
>
> Can you provide us with a sample pdf with comments? I am not sure I
> have any....
Look at http://mst.cs.drexel.edu/jeff/phd-3.pdf (auto-delete in 2 weeks).
It's a copy of a draft of my PhD dissertation. Enjoy. ;-)
--
Jeff
Jeff Abrahamson <http://jeff.purple.com/>
phone: +33 06 21.83.26.20 (From U.S.: 011-33-6-2183-2620)
GPG fingerprint: 1A1A BA95 D082 A558 A276 63C6 16BF 8C4C 0D1D AE4B
Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|