Walt Mankowski on 20 Feb 2008 09:05:24 -0800 |
During my talk at Plug West Monday night, one of the Perl NLP modules I talked about was Lingua::Identify. This is an interesting module that tries to guess what language a given text string is. Lingua::Identify exports a function called langof(). If you call langof() in scalar context it returns the most likely language, but if you call it in list context it returns a list of languages paired with its estimated probability of the text being that language. As an example I passed in the text of the GPL (Version 1, if anyone's interested). Its top 3 guesses were: English 26.7% French 6.7% Romanian 4.3% I also mentioned the language it thought the GPL was least likely to have been written in: Turkish 0.7% Someone in the audience correctly pointed out that, given that range of numbers and that I said it knew about 15 or 20 languages, it seemed unlikely that they would add up to 100%. Well, it turns out I was off by quite a bit. Lingua::Identify actually knows about 36 different languages. Here's the full output: en 26.7 fr 6.7 ro 4.3 da 4.0 ga 3.4 it 3.3 sv 3.2 nl 2.9 af 2.9 de 2.9 pt 2.8 no 2.8 fy 2.7 es 2.5 br 2.5 eo 2.4 la 2.1 hu 1.8 lv 1.8 ru 1.6 sq 1.5 pl 1.5 is 1.4 id 1.3 ms 1.2 bs 1.2 hr 1.1 so 1.1 sl 1.1 cy 1.1 et 0.9 sw 0.9 fi 0.8 eu 0.8 tr 0.7 You can see which languages correspond to those codes at http://search.cpan.org/~cog/Lingua-Identify-0.19/lib/Lingua/Identify.pm#KNOWN_LANGUAGES. You also might notice that there are only 35 languages there. For some reason it didn't give an estimate for Bulgarian. If you look at the rest of the page you can see a description of how it works internally. I wish now I'd read it before my talk, because it's actually pretty interesting. They take sample texts from each of those 36 languages and analyze them to find the most frequent short words, prefixes, suffixes, and ngrams in the samples. Then given a new string they do the same thing, then see which language has frequencies closest to the sample text. Walt Attachment:
signature.asc ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|