Berthoff, Tom on 21 Feb 2008 06:48:12 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Code performance


Hi,

I'm relatively new to Perl, so I apologize in advance about any clueless
questions I might pose. I am, after all, essentially clueless.

Here's the question:

Can anyone suggest techniques for further optimizing (meaning increasing
the scanning speed of and reducing the system resources consumed by) the
following:

I have a Perl script that 

1) gets a list of files
2) navigates to a specific directory tree on a server
3) looks for files in the directory tree and for each file
   3.1) opens the file
   3.2) scans each line of the file for references to the files in my
list collected in step 1.
   3.3) If the file name is found, the match information is stored in an
output file. "Match" information means: location of the file being
scanned, name of the file being scanned, and the matching file from the
original list.

I've gone through several iterations of this code trying to make it more
intelligent to increase the number of matches by looking for partial
matches. I do this by removing wildcards, formatting characters, and
numbers from each file name in the original match list, and then
tokenizing each file name in the original match list so that I'm
matching on each element of the file name, not just the full file name
(which might contain wildcards, replacement characters, date formats,
and so on). So, for instance the file name

Equity.namr.out.enc.gz 

Becomes

equity#namr#out#enc#gz

or an array

equity
namr
out
dec
gz

I do the same thing for each line of the file. So for instance if I'm
scanning a shell script with the line

EQUITY_NAMR_FILE=equity.namr.out.dec.$DATE.gz

the line gets tokenized to

equity#namr#file#equity#namr#out#dec#date#gz

When I compare the two, I get a score based on the number of elements
matched between the two strings.

Needless to say, this hummer is old dog slow and it consumes a bunch of
system resources. 

I've tried a number of techniques to get it to run faster, including:

Only searching executables (which is what I'm interested in, mainly,
although this means I miss configuration files that are read by the
executable but that might contain file references)

Limiting the search to files under 50K (again, not optimal, because
there might be big executables and/or reference files out there).

Scanning the original match list once and storing it in an array.

Putting the original match list in a referenced array.

Putting each file line in a referenced array.

Thanks in advance for your help,

Tom Berthoff

Enterprise Technology
Susquehanna International Group, LLP
x1024
484-562-1024
tom.berthoff@sig.com


IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments.  Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited.  Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument.  Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

-
**Majordomo list services provided by PANIX <URL:http://www.panix.com>**
**To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**