Kyle R . Burton on Wed, 7 Nov 2001 12:53:24 -0500


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: generating regexes?


> this code looks very interesting.  here's a thought off the
> top of my head, dating from 1999 when I was doing coursework
> in a related field.
> 
> i wonder if it would be feasible to evolve your regular
> expressions as genetic algorithms in the usual hill-climbing
> way.  instead of refining a single pattern, try keeping a
> stable of, say, a hundred possible patterns that each may
> match only a certain subset of the input data.  refine each
> pattern through random mutations, so that a pattern from a
> parent generation produces more than one child.  the
> selection pressure would be based primary on success against
> the input data space and secondarily on length of regexp.


Actualy, I was thinking of doing something genetic, but I decided to try
the approach in the linked to code because I was able to think it up and
write it in about an hour's worth of time, and it works for most of the
situations where I currently have a need for this type of thing.  The genetic 
approaches I was considering would have taken alot more effort to write 
(partly because I have very little knowledge/experience with genetic 
programming).

My current approach is extremly simplistic, goes from the left, character
by character and selects what it think is the most specific candidate regex
that matches all the examples.  It performs less well when the examples
are of varaible length, and makes no efforts to recognize stemming between
the examples.

I'd think a GP approach would better overall, but I don't know a good way 
to go about it.  It would require somehow determining a set of starting
regexes to use right?  I guess generating the initial set shouldn't be too
hard...survival critera should center around how many of the examples it
matches, but somehow I'd want it to create general, but fairly specific
regexes.  For phone numbers, I'd rather have:

  \(?\d{3}\) \d{3}-\d{3}

than:

  .+\s\d+-\d+

or, worse:

  .+


k

-- 

------------------------------------------------------------------------------
Just get rid of the false and you will
automaticly realize the true.
        -- Ho-Shan
mortis@voicenet.com                            http://www.voicenet.com/~mortis
------------------------------------------------------------------------------
**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**