Mark Dominus on 27 Sep 2005 14:03:06 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Searching packed data


A guy in comp.lang.perl.moderated has a pretty interesting problem.
He has digits of pi in what is called "BCD" format.  This is a format
that encodes two decimal digits per character.  The two digits $a and
$b are packed into a character with $a in the high bits and $b in the
low bits.  For example, the character "R" = 0x52 represents the two
decimal digits 5 and 2.

To represent digits of pi, say the ten digits "1415926536", we pack
each pair of digits into a character.  The easy way to do this in Perl is

        pack "H*", "1415926536"

and the result is the 5-character string "\x14\x15\x92\x65\x36".

Now suppose we have a very large string, say one million characters
long, which therefore represents two million digits of pi.  And we
want to know if the digit sequence "123456" appears anywhere.
Obviously, one way to do this is to unpack the two million digits and
search the resulting two-megabyte string:

        my $pos = index(unpack("H*", $pi), "123456");

But the guy in clp.moderated doesn't want to do this; he wants to
search the string *without* unpacking it.

Without intending any comment on whether this is a sensible thing to
want to do, I wrote code to solve the problem.  It came out very nice,
and I thought other folks might be interested in the exercise.  My
code will show up in clp.moderated when the moderator gets around to
approving it.  In the meantime, you might want to try it yourself;
I found it fun.

What I learned from this project: it's a lot easier to write good code
when I've gotten a full night's sleep.



-
**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**