Berthoff, Tom on 21 Feb 2008 06:48:12 -0800 |
Hi, I'm relatively new to Perl, so I apologize in advance about any clueless questions I might pose. I am, after all, essentially clueless. Here's the question: Can anyone suggest techniques for further optimizing (meaning increasing the scanning speed of and reducing the system resources consumed by) the following: I have a Perl script that 1) gets a list of files 2) navigates to a specific directory tree on a server 3) looks for files in the directory tree and for each file 3.1) opens the file 3.2) scans each line of the file for references to the files in my list collected in step 1. 3.3) If the file name is found, the match information is stored in an output file. "Match" information means: location of the file being scanned, name of the file being scanned, and the matching file from the original list. I've gone through several iterations of this code trying to make it more intelligent to increase the number of matches by looking for partial matches. I do this by removing wildcards, formatting characters, and numbers from each file name in the original match list, and then tokenizing each file name in the original match list so that I'm matching on each element of the file name, not just the full file name (which might contain wildcards, replacement characters, date formats, and so on). So, for instance the file name Equity.namr.out.enc.gz Becomes equity#namr#out#enc#gz or an array equity namr out dec gz I do the same thing for each line of the file. So for instance if I'm scanning a shell script with the line EQUITY_NAMR_FILE=equity.namr.out.dec.$DATE.gz the line gets tokenized to equity#namr#file#equity#namr#out#dec#date#gz When I compare the two, I get a score based on the number of elements matched between the two strings. Needless to say, this hummer is old dog slow and it consumes a bunch of system resources. I've tried a number of techniques to get it to run faster, including: Only searching executables (which is what I'm interested in, mainly, although this means I miss configuration files that are read by the executable but that might contain file references) Limiting the search to files under 50K (again, not optimal, because there might be big executables and/or reference files out there). Scanning the original match list once and storing it in an array. Putting the original match list in a referenced array. Putting each file line in a referenced array. Thanks in advance for your help, Tom Berthoff Enterprise Technology Susquehanna International Group, LLP x1024 484-562-1024 tom.berthoff@sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. - **Majordomo list services provided by PANIX <URL:http://www.panix.com>** **To Unsubscribe, send "unsubscribe phl" to majordomo@lists.pm.org**
|
|