Jeff Abrahamson on 30 Jul 2007 08:47:31 -0000 |
On Sun, Jul 29, 2007 at 06:20:55PM -0400, TuskenTower wrote: > [45 lines, 264 words, 1911 characters] Top characters: _tonieal > > Jeff, I don't have a suggestion, but do you mind sharing your PERL > and lex magic? I have looked at PDFs briefly to do some text > extraction but couldn't find what I wanted. So I manually copied > data out in 3 hrs. I've attached the perl script that I wrote in two minutes. Much easier. ;-) The key (an additional ten minutes) is to know that notes are in blocks that look like "Contents(note goes here)". I found this by copying the pdf to a Mac, looking at it in acrobat, then, back on my linux box, I emacsed the pdf and searched for some of the comment strings that I had read on the Mac. The pattern was quickly obvious. From there, writing the perl took < 2 min. As noted, I didn't write the lex version. I'd rather check the odd note that comes out malformed because of a close paren. Contents(Look at the cat (fido) first.) ==> Look at the cat (fido and I'd know to look further. On Sun, Jul 29, 2007 at 08:03:56PM -0400, Mag Gam wrote: > [62 lines, 342 words, 2148 characters] Top characters: _toneias > > Can you provide us with a sample pdf with comments? I am not sure I > have any.... Look at http://mst.cs.drexel.edu/jeff/phd-3.pdf (auto-delete in 2 weeks). It's a copy of a draft of my PhD dissertation. Enjoy. ;-) -- Jeff Jeff Abrahamson <http://jeff.purple.com/> phone: +33 06 21.83.26.20 (From U.S.: 011-33-6-2183-2620) GPG fingerprint: 1A1A BA95 D082 A558 A276 63C6 16BF 8C4C 0D1D AE4B Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|