JP Vossen on 21 Oct 2009 22:00:54 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] PCRE is_a_regex?


> Date: Wed, 21 Oct 2009 23:20:29 -0400
> From: Walt Mankowski <waltman@pobox.com>
> 
> On Wed, Oct 21, 2009 at 03:50:01PM -0400, JP Vossen wrote:
>> I have 300K+ valid regular expressions.  I am abstracting out one more 
>> level on about 100K of them, and instead of being matched as regexps, 
>> they will be matched as static strings in a hash table (in Java). 
> 
> Are you doing that because the hash table lookup is faster?  It's not
> obvious to me that it should be.  Doing a regex comparison on a plain
> string should be pretty close to just checking if the two strings are
> equal, and that's going to be O(n).  A hash lookup also has to be O(n)
> since you've got to read the string to hash it.

We've already read the string, now we're trying to figure out what it 
is.  A hash lookup will be faster to fail than trying 300K+ regex 
matches and not hitting on any, and it will be pretty close to faster in 
the best case...  But the more important aspect is RAM use, we're close 
to 1G RAM for 300K+ compiled regexps (Java/Jakarta ORA, not Perl).  RAM 
upgrades for many hundreds of devices in the field is not an option. 
The hash solution takes almost no RAM when compared to the compiled regexps.


So far I've now got this, which is "working" for my data-set, for some 
value of "working" that needs more testing.  It is NOT a universal 
will-always-work solution, though.

m/(?:^|[^\\])(?:\\[dDwWsShHvVRCpPbBAzZG]{1}|[[{])/;
     # Ether start-of-line or NOT \
     # then \ followed by one of the meta-chars or anchors
     #   or [ or {
     #
     # The regex above is derived from the following:
     #    http://perldoc.perl.org/perlreref.html#CHARACTER-CLASSES
     #    http://perldoc.perl.org/perlreref.html#ANCHORS
     # Started trying to use \b but that missed a lot of stuff that was
     # not at a "word" boundary.


Thanks,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug