Jeff Abrahamson on 2 Mar 2004 14:42:04 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: pattern finding problem


On Tue, Mar 02, 2004 at 08:01:07AM -0500, Aaron J. Mackey wrote:
>   [25 lines, 164 words, 964 characters]  Top characters: teionasl
> 
> 
> On the BioPerl mailing list we often get requests like the following:
> 
> Within a given biosequence with length X, find substrings of min. 
> length A and max. length B that contain the pattern P at least C times 
> but no more than D times.
> 
> A more concrete example: Find all substrings 12 characters long (A = B 
> = 12) that have at least 7 (C = 7, D = 12 implictly) 'I' or 'L' 
> characters (P = [IL]) in it.
> 
> The naive approach is a "sliding window" method, but it seems to me 
> that a pattern matching approach would be more efficient.  And it 
> sounds like a great little challenge for the brilliant minds of FWP.  
> The "best" version will find it's way into a BioPerl module (with 
> appropriate attribution, of course).  Golfing is not the goal here (but 
> Golf-ed solutions are still welcome, if you must).

Do you allow overlapping patterns?  For example, if you are looking
for 1212, does the following string contain three instances or only
two?

    12121212

Do you allow intervening characters?  Searching for 12, do you match
on this?

    132

Are you doing full regex matching, subsequence matching, or simple
string matching?  If the latter, I guess this is easy using KMP and
can be done in linear time, so you must be asking about regex...


In the regex case you can simplify the problem by creating a new
alphabet with symbols P and Q.  Create a new string S on the new
alphabet, where the k-th symbol of S is P if an instance of your
pattern (of length less than B) begins at location k in the original
string.  Otherwise symbol k is Q.  With the symbol at position k store
the length of the shortest instance of the pattern that begins at
position k.

The problem on the new alphabet is a variant of ordinary string
matching.

-- 
 Jeff

 Jeff Abrahamson  <http://www.purple.com/jeff/>
 GPG fingerprint: 1A1A BA95 D082 A558 A276  63C6 16BF 8C4C 0D1D AE4B
-
**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**