[PLUG] Lingua::Identify languages

Walt Mankowski on 20 Feb 2008 09:05:24 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] Lingua::Identify languages

From: Walt Mankowski <waltman@pobox.com>

To: plug@lists.phillylinux.org

Subject: [PLUG] Lingua::Identify languages

Date: Wed, 20 Feb 2008 11:37:05 -0500

Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Sender: plug-bounces@lists.phillylinux.org

User-agent: Mutt/1.5.17+20080114 (2008-01-14)

During my talk at Plug West Monday night, one of the Perl NLP modules I talked about was Lingua::Identify. This is an interesting module that tries to guess what language a given text string is. Lingua::Identify exports a function called langof(). If you call langof() in scalar context it returns the most likely language, but if you call it in list context it returns a list of languages paired with its estimated probability of the text being that language. As an example I passed in the text of the GPL (Version 1, if anyone's interested). Its top 3 guesses were: English 26.7% French 6.7% Romanian 4.3% I also mentioned the language it thought the GPL was least likely to have been written in: Turkish 0.7% Someone in the audience correctly pointed out that, given that range of numbers and that I said it knew about 15 or 20 languages, it seemed unlikely that they would add up to 100%. Well, it turns out I was off by quite a bit. Lingua::Identify actually knows about 36 different languages. Here's the full output: en 26.7 fr 6.7 ro 4.3 da 4.0 ga 3.4 it 3.3 sv 3.2 nl 2.9 af 2.9 de 2.9 pt 2.8 no 2.8 fy 2.7 es 2.5 br 2.5 eo 2.4 la 2.1 hu 1.8 lv 1.8 ru 1.6 sq 1.5 pl 1.5 is 1.4 id 1.3 ms 1.2 bs 1.2 hr 1.1 so 1.1 sl 1.1 cy 1.1 et 0.9 sw 0.9 fi 0.8 eu 0.8 tr 0.7 You can see which languages correspond to those codes at http://search.cpan.org/~cog/Lingua-Identify-0.19/lib/Lingua/Identify.pm#KNOWN_LANGUAGES. You also might notice that there are only 35 languages there. For some reason it didn't give an estimate for Bulgarian. If you look at the rest of the page you can see a description of how it works internally. I wish now I'd read it before my talk, because it's actually pretty interesting. They take sample texts from each of those 36 languages and analyze them to find the most frequent short words, prefixes, suffixes, and ngrams in the samples. Then given a new string they do the same thing, then see which language has frequencies closest to the sample text. Walt
Attachment: signature.asc
Description: Digital signature

___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

Follow-Ups:

Re: [PLUG] Lingua::Identify languages
From: "Kristian Erik Hermansen" <kristian.hermansen@gmail.com>

Prev by Date: Re: [PLUG] Fax from linux advice

Next by Date: [PLUG] Ubuntu NJ LoCo Team Linux Hack Fest

Previous by thread: Re: [PLUG] New Microsoft Commerical

Next by thread: Re: [PLUG] Lingua::Identify languages

Index(es):

Date

Thread