Jeff Abrahamson on 2 Mar 2004 14:42:04 -0000 |
On Tue, Mar 02, 2004 at 08:01:07AM -0500, Aaron J. Mackey wrote: > [25 lines, 164 words, 964 characters] Top characters: teionasl > > > On the BioPerl mailing list we often get requests like the following: > > Within a given biosequence with length X, find substrings of min. > length A and max. length B that contain the pattern P at least C times > but no more than D times. > > A more concrete example: Find all substrings 12 characters long (A = B > = 12) that have at least 7 (C = 7, D = 12 implictly) 'I' or 'L' > characters (P = [IL]) in it. > > The naive approach is a "sliding window" method, but it seems to me > that a pattern matching approach would be more efficient. And it > sounds like a great little challenge for the brilliant minds of FWP. > The "best" version will find it's way into a BioPerl module (with > appropriate attribution, of course). Golfing is not the goal here (but > Golf-ed solutions are still welcome, if you must). Do you allow overlapping patterns? For example, if you are looking for 1212, does the following string contain three instances or only two? 12121212 Do you allow intervening characters? Searching for 12, do you match on this? 132 Are you doing full regex matching, subsequence matching, or simple string matching? If the latter, I guess this is easy using KMP and can be done in linear time, so you must be asking about regex... In the regex case you can simplify the problem by creating a new alphabet with symbols P and Q. Create a new string S on the new alphabet, where the k-th symbol of S is P if an instance of your pattern (of length less than B) begins at location k in the original string. Otherwise symbol k is Q. With the symbol at position k store the length of the shortest instance of the pattern that begins at position k. The problem on the new alphabet is a variant of ordinary string matching. -- Jeff Jeff Abrahamson <http://www.purple.com/jeff/> GPG fingerprint: 1A1A BA95 D082 A558 A276 63C6 16BF 8C4C 0D1D AE4B - **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**
|
|