JP Vossen on 21 Oct 2009 12:50:09 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] PCRE is_a_regex?


Date: Wed, 21 Oct 2009 14:45:25 -0400
From: Walt Mankowski <waltman@pobox.com>

 > The way you describe it, it sounds impossible.  As you said, every
 > normal string is already a valid regex, and many regex sequences can
 > occur normally within strings.  Short of writing your own regex
 > parser, it seems like the best you can do is search for some common
 > patterns that occur in regexes.

Yeah, the latter is what I'm trying to do now.  It'll probably work for 
my data-set, but I was hoping to find a more elegant solution.

So far I have this, which is NOT even close to universal and does NOT 
even work on 100% of my data yet.

m/\b\\[dDwWsShHvVRCpPbBAzZG]{1}/ ) {
     # The regex above is derived from the following:
     #    http://perldoc.perl.org/perlreref.html#CHARACTER-CLASSES
     #    http://perldoc.perl.org/perlreref.html#ANCHORS

I was hoping for a more subtle trick, such as the following.  There is 
an ugly, ugly hack when grepping the process list to avoid your grep:
	ps auwx | grep 'foo' | grep -v 'grep'

The better way to do that is:
	ps auwx | grep '[f]oo'

The string '[f]oo' does not match the string 'foo' but the regex '[f]oo' 
does match the string 'foo' so grep sees the "foo" process you want but 
does not see itself.

I want some Perl trickery that does something like that.  So far I 
haven't figured it out, so I'm falling back to brute force to get the 
code written.  :-(


 > But I'm not really clear why you need to do this in the first place.
 > Since static strings are valid regular expressions, what's the harm in
 > just treating everything as a regex?  Also, since the users entering
 > these strings presumably knows whether or not they're supposed to be
 > regexes, isn't there any way you can get them to indicate it in the
 > data somehow?

I have 300K+ valid regular expressions.  I am abstracting out one more 
level on about 100K of them, and instead of being matched as regexps, 
they will be matched as static strings in a hash table (in Java). 
Something like "foo: \s+bar baz \!" cannot be represented accurately as 
a static string, whereas "foo: bar baz \!" is fine.  So I need to be 
able to tell the difference, convert the static ones into hash keys, and 
ignore that ones that must remain regexps.  And the 100K source regexps 
are for matching Snort signatures, which in part may include PCRE.  Are 
we having fun yet?  (Actually, yes.  Sick, ain't it? :)

So in effect, *I'm* the user that needs to go into the 100K records and 
flag them as regexp or string (hash).


 > Finally, I'm confused by your use of "PCRE".  Do you mean "perl
 > regular expressions", or the the PCRE library for "Perl Compatible
 > Regular Expressions"?

Yes.

My development work is in Perl (and the Regex Coach), but some of the 
production parts run in the Java/Jakarta ORA PCRE lib.  Sigh.

Thanks,
JP

PS--Jeff, sure I've got lots more problems than these, some of which are 
even treatable.  It's all about the context man!
----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug