[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: Fuzzy Matching resources?
|
- From: "Kyle R. Burton" <kyle.burton@gmail.com>
- To: philly-lambda@googlegroups.com
- Subject: Re: Fuzzy Matching resources?
- Date: Thu, 21 May 2009 08:52:26 -0400
- Authentication-results: gmr-mx.google.com; spf=pass (google.com: domain of kyle.burton@gmail.com designates 74.125.92.25 as permitted sender) smtp.mail=kyle.burton@gmail.com; dkim=pass (test mode) header.i=@gmail.com
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:received:x-sender:x-apparently-to :received:received:received-spf:authentication-results:received :dkim-signature:domainkey-signature:mime-version:received :in-reply-to:references:date:message-id:subject:from:to:content-type :content-transfer-encoding:reply-to:sender:precedence:x-google-loop :mailing-list:list-id:list-post:list-help:list-unsubscribe :x-beenthere-env:x-beenthere; bh=+5vT99PwSXuMWuJFuChuPzdiNN2tol0dfS3hdlpCDGQ=; b=EZsqp2NA/8YG0Gs2OR9m3E4H0dOOebhh7+nE/1PzQd4zvDxUIaZf0Gm6RvdscHFqDe +Wg8sxRZpe32pXjAZ4yE0r6TZL4Ea2VybnK3aYtf8BO1nC3Pr1MKm3eb0vWJ3cfMC6kf ygnIpfAtIahRNyZEGoFFwwq9MUq5HYWC4vp1Y=
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=I4X6wMFSJPTROeQGTssTQKhJBdSchZQJH9ihYa+cEOE=; b=lMJRbkp1OhuqaEHsctWafY6n/KDF9dThf2ZC/Q0HQRBEUIcIXMzw/Y4QEJNiH7W0uu zHfTj2J6nzRGNuKRkWTn8CCWvqC7jMGAkpysFoigBzkTV6fNOC8p6YjYqL7BqzkQ0YdW USJLHKrlVhW6T/0N5VhO4Ni2VQNr4gXo5z+xk=
- Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owner@googlegroups.com
- Reply-to: philly-lambda@googlegroups.com
- Sender: philly-lambda@googlegroups.com
> I have a project that requires that I clean up a list of student-
> entered teacher names (Mrs. Powell, MRS. Powell, Ms. Powle, etc, ==
> Mrs. Jane Powell). I know the list of ideal names.
>
> The whole project involves combining 30 or so excel files, about 6-20k
> lines, and de-duping them. One part of the de-duping is fixing this
> single field. Kyle Burton suggested using the Fuzzy Matching module.
> I'm a rank amateur at programming, let alone Perl specifically, but I
> hope someone has something that could help me out.
For Perl, some CPAN modules to look into are:
String::Approx http://search.cpan.org/dist/String-Approx/Approx.pm
Text::Levenshtein
http://search.cpan.org/~jgoldberg/Text-Levenshtein-0.05/Levenshtein.pm
Text::Brew http://search.cpan.org/~kcivey/Text-Brew-0.02/lib/Text/Brew.pm
String::Nysiis http://search.cpan.org/dist/String-Nysiis/
Text::Soundex http://search.cpan.org/~markm/Text-Soundex-3.03/Soundex.pm
Text::DoubleMetaphone
http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm
Nysiis, Soundex and DoubleMetaphone can be used both to perform [very]
fuzzy comparisons and to create an index to use as a basis for other
fuzzy matching.
The Approx and Levenshtein (which is edit distance) can be used to
count the # of edits, and to calculate a similarity percentage ( 1 -
#edits / length of longer string).
The approx (adist) and Levenshtein may be what you're after.
Is this the kind of info you were after?
Regards,
Kyle
|
|