Walt Mankowski on 20 Feb 2008 09:05:24 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] Lingua::Identify languages


During my talk at Plug West Monday night, one of the Perl NLP modules
I talked about was Lingua::Identify.  This is an interesting module
that tries to guess what language a given text string is.

Lingua::Identify exports a function called langof().  If you call
langof() in scalar context it returns the most likely language, but if
you call it in list context it returns a list of languages paired with
its estimated probability of the text being that language.

As an example I passed in the text of the GPL (Version 1, if anyone's
interested).  Its top 3 guesses were:

  English    26.7%
  French      6.7%
  Romanian    4.3%

I also mentioned the language it thought the GPL was least likely to
have been written in:

  Turkish     0.7%

Someone in the audience correctly pointed out that, given that range
of numbers and that I said it knew about 15 or 20 languages, it seemed
unlikely that they would add up to 100%.  Well, it turns out I was off
by quite a bit.  Lingua::Identify actually knows about 36 different
languages.  Here's the full output:

en 26.7
fr 6.7
ro 4.3
da 4.0
ga 3.4
it 3.3
sv 3.2
nl 2.9
af 2.9
de 2.9
pt 2.8
no 2.8
fy 2.7
es 2.5
br 2.5
eo 2.4
la 2.1
hu 1.8
lv 1.8
ru 1.6
sq 1.5
pl 1.5
is 1.4
id 1.3
ms 1.2
bs 1.2
hr 1.1
so 1.1
sl 1.1
cy 1.1
et 0.9
sw 0.9
fi 0.8
eu 0.8
tr 0.7

You can see which languages correspond to those codes at
http://search.cpan.org/~cog/Lingua-Identify-0.19/lib/Lingua/Identify.pm#KNOWN_LANGUAGES.
You also might notice that there are only 35 languages there.  For
some reason it didn't give an estimate for Bulgarian.

If you look at the rest of the page you can see a description of how
it works internally.  I wish now I'd read it before my talk, because
it's actually pretty interesting.  They take sample texts from each of
those 36 languages and analyze them to find the most frequent short
words, prefixes, suffixes, and ngrams in the samples.  Then given a
new string they do the same thing, then see which language has
frequencies closest to the sample text.

Walt

Attachment: signature.asc
Description: Digital signature

___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug