Jeff Abrahamson on 30 Jul 2007 08:47:31 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] regex


On Sun, Jul 29, 2007 at 06:20:55PM -0400, TuskenTower wrote:
>   [45 lines, 264 words, 1911 characters]  Top characters: _tonieal
> 
> Jeff, I don't have a suggestion, but do you mind sharing your PERL
> and lex magic?  I have looked at PDFs briefly to do some text
> extraction but couldn't find what I wanted.  So I manually copied
> data out in 3 hrs.

I've attached the perl script that I wrote in two minutes.  Much
easier. ;-)

The key (an additional ten minutes) is to know that notes are in
blocks that look like "Contents(note goes here)".  I found this by
copying the pdf to a Mac, looking at it in acrobat, then, back on my
linux box, I emacsed the pdf and searched for some of the comment
strings that I had read on the Mac.  The pattern was quickly obvious.
From there, writing the perl took < 2 min.

As noted, I didn't write the lex version.  I'd rather check the odd
note that comes out malformed because of a close paren.

    Contents(Look at the cat (fido) first.)

  ==>

    Look at the cat (fido

and I'd know to look further.


On Sun, Jul 29, 2007 at 08:03:56PM -0400, Mag Gam wrote:
>   [62 lines, 342 words, 2148 characters]  Top characters: _toneias
> 
> Can you provide us with a sample pdf with comments? I am not sure I
> have any....

Look at http://mst.cs.drexel.edu/jeff/phd-3.pdf (auto-delete in 2 weeks).

It's a copy of a draft of my PhD dissertation.  Enjoy. ;-)

-- 
 Jeff

 Jeff Abrahamson  <http://jeff.purple.com/>
 phone: +33 06 21.83.26.20           (From U.S.: 011-33-6-2183-2620)
 GPG fingerprint: 1A1A BA95 D082 A558 A276  63C6 16BF 8C4C 0D1D AE4B

Attachment: signature.asc
Description: Digital signature

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug