Re: Fuzzy Matching resources?

Kyle R. Burton on 21 May 2009 05:52:39 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: Fuzzy Matching resources?

From: "Kyle R. Burton" <kyle.burton@gmail.com>

To: philly-lambda@googlegroups.com

Subject: Re: Fuzzy Matching resources?

Date: Thu, 21 May 2009 08:52:26 -0400

Authentication-results: gmr-mx.google.com; spf=pass (google.com: domain of kyle.burton@gmail.com designates 74.125.92.25 as permitted sender) smtp.mail=kyle.burton@gmail.com; dkim=pass (test mode) header.i=@gmail.com

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:received:x-sender:x-apparently-to :received:received:received-spf:authentication-results:received :dkim-signature:domainkey-signature:mime-version:received :in-reply-to:references:date:message-id:subject:from:to:content-type :content-transfer-encoding:reply-to:sender:precedence:x-google-loop :mailing-list:list-id:list-post:list-help:list-unsubscribe :x-beenthere-env:x-beenthere; bh=+5vT99PwSXuMWuJFuChuPzdiNN2tol0dfS3hdlpCDGQ=; b=EZsqp2NA/8YG0Gs2OR9m3E4H0dOOebhh7+nE/1PzQd4zvDxUIaZf0Gm6RvdscHFqDe +Wg8sxRZpe32pXjAZ4yE0r6TZL4Ea2VybnK3aYtf8BO1nC3Pr1MKm3eb0vWJ3cfMC6kf ygnIpfAtIahRNyZEGoFFwwq9MUq5HYWC4vp1Y=

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=I4X6wMFSJPTROeQGTssTQKhJBdSchZQJH9ihYa+cEOE=; b=lMJRbkp1OhuqaEHsctWafY6n/KDF9dThf2ZC/Q0HQRBEUIcIXMzw/Y4QEJNiH7W0uu zHfTj2J6nzRGNuKRkWTn8CCWvqC7jMGAkpysFoigBzkTV6fNOC8p6YjYqL7BqzkQ0YdW USJLHKrlVhW6T/0N5VhO4Ni2VQNr4gXo5z+xk=

Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owner@googlegroups.com

Reply-to: philly-lambda@googlegroups.com

Sender: philly-lambda@googlegroups.com

> I have a project that requires that I clean up a list of student- > entered teacher names (Mrs. Powell, MRS. Powell, Ms. Powle, etc, == > Mrs. Jane Powell). I know the list of ideal names. > > The whole project involves combining 30 or so excel files, about 6-20k > lines, and de-duping them. One part of the de-duping is fixing this > single field. Kyle Burton suggested using the Fuzzy Matching module. > I'm a rank amateur at programming, let alone Perl specifically, but I > hope someone has something that could help me out. For Perl, some CPAN modules to look into are: String::Approx http://search.cpan.org/dist/String-Approx/Approx.pm Text::Levenshtein http://search.cpan.org/~jgoldberg/Text-Levenshtein-0.05/Levenshtein.pm Text::Brew http://search.cpan.org/~kcivey/Text-Brew-0.02/lib/Text/Brew.pm String::Nysiis http://search.cpan.org/dist/String-Nysiis/ Text::Soundex http://search.cpan.org/~markm/Text-Soundex-3.03/Soundex.pm Text::DoubleMetaphone http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm Nysiis, Soundex and DoubleMetaphone can be used both to perform [very] fuzzy comparisons and to create an index to use as a basis for other fuzzy matching. The Approx and Levenshtein (which is edit distance) can be used to count the # of edits, and to calculate a similarity percentage ( 1 - #edits / length of longer string). The approx (adist) and Levenshtein may be what you're after. Is this the kind of info you were after? Regards, Kyle

Follow-Ups:

Re: Fuzzy Matching resources?
From: Bonnie Aumann <aumannb@gmail.com>

References:

Fuzzy Matching resources?
From: Bonnie Aumann <aumannb@gmail.com>

Prev by Date: Fuzzy Matching resources?

Next by Date: Re: Fuzzy Matching resources?

Previous by thread: Fuzzy Matching resources?

Next by thread: Re: Fuzzy Matching resources?

Index(es):

Date

Thread