Kyle R. Burton on 21 May 2009 05:52:39 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Fuzzy Matching resources?

  • From: "Kyle R. Burton" <kyle.burton@gmail.com>
  • To: philly-lambda@googlegroups.com
  • Subject: Re: Fuzzy Matching resources?
  • Date: Thu, 21 May 2009 08:52:26 -0400
  • Authentication-results: gmr-mx.google.com; spf=pass (google.com: domain of kyle.burton@gmail.com designates 74.125.92.25 as permitted sender) smtp.mail=kyle.burton@gmail.com; dkim=pass (test mode) header.i=@gmail.com
  • Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:received:x-sender:x-apparently-to :received:received:received-spf:authentication-results:received :dkim-signature:domainkey-signature:mime-version:received :in-reply-to:references:date:message-id:subject:from:to:content-type :content-transfer-encoding:reply-to:sender:precedence:x-google-loop :mailing-list:list-id:list-post:list-help:list-unsubscribe :x-beenthere-env:x-beenthere; bh=+5vT99PwSXuMWuJFuChuPzdiNN2tol0dfS3hdlpCDGQ=; b=EZsqp2NA/8YG0Gs2OR9m3E4H0dOOebhh7+nE/1PzQd4zvDxUIaZf0Gm6RvdscHFqDe +Wg8sxRZpe32pXjAZ4yE0r6TZL4Ea2VybnK3aYtf8BO1nC3Pr1MKm3eb0vWJ3cfMC6kf ygnIpfAtIahRNyZEGoFFwwq9MUq5HYWC4vp1Y=
  • Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=I4X6wMFSJPTROeQGTssTQKhJBdSchZQJH9ihYa+cEOE=; b=lMJRbkp1OhuqaEHsctWafY6n/KDF9dThf2ZC/Q0HQRBEUIcIXMzw/Y4QEJNiH7W0uu zHfTj2J6nzRGNuKRkWTn8CCWvqC7jMGAkpysFoigBzkTV6fNOC8p6YjYqL7BqzkQ0YdW USJLHKrlVhW6T/0N5VhO4Ni2VQNr4gXo5z+xk=
  • Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owner@googlegroups.com
  • Reply-to: philly-lambda@googlegroups.com
  • Sender: philly-lambda@googlegroups.com

> I have a project that requires that I clean up a list of student-
> entered teacher names (Mrs. Powell, MRS. Powell,  Ms. Powle, etc, ==
> Mrs. Jane Powell). I know the list of ideal names.
>
> The whole project involves combining 30 or so excel files, about 6-20k
> lines, and de-duping them. One part of the de-duping is fixing this
> single field. Kyle Burton suggested using the Fuzzy Matching module.
> I'm a rank amateur at programming, let alone Perl specifically, but I
> hope someone has something that could help me out.

For Perl, some CPAN modules to look into are:

String::Approx http://search.cpan.org/dist/String-Approx/Approx.pm
Text::Levenshtein
http://search.cpan.org/~jgoldberg/Text-Levenshtein-0.05/Levenshtein.pm
Text::Brew http://search.cpan.org/~kcivey/Text-Brew-0.02/lib/Text/Brew.pm
String::Nysiis http://search.cpan.org/dist/String-Nysiis/
Text::Soundex http://search.cpan.org/~markm/Text-Soundex-3.03/Soundex.pm
Text::DoubleMetaphone
http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm

Nysiis, Soundex and DoubleMetaphone can be used both to perform [very]
fuzzy comparisons and to create an index to use as a basis for other
fuzzy matching.

The Approx and Levenshtein (which is edit distance) can be used to
count the # of edits, and to calculate a similarity percentage ( 1 -
#edits / length of longer string).

The approx (adist) and Levenshtein may be what you're after.

Is this the kind of info you were after?


Regards,

Kyle