Kristian Erik Hermansen on 20 Feb 2008 13:10:16 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Lingua::Identify languages


#!/usr/bin/env python
# Kristian Hermansen <kristian.hermansen@gmail.com>
# Version: 20070305
# Markov-chain Language Learner (digrams)
# Learns a language using training data files in 'langfiles'.
# Then takes test input and determines the most likely language.
# And fo fun, I added support for Latin as well ;-)

import sys
from numpy import *
from string import *

# language files with training data
#langfiles = ['english', 'french', 'german', 'italian', 'portuguese', 'spanish']
langfiles = ['english', 'french', 'german', 'italian', 'latin',
'portuguese', 'spanish']

# returns the ascii value offset for valid chars
# or 0 if not a valid learning char.
def filter_char(c):
    # 'a' == 97, so -96 gets us to 1
    offset = 96
    c = ord(lower(c)) - offset
    if c >= 1 and c <= 26:
        return c
    else:
        return 0

# create frequency/probability matrices
def make_freq_prob_mat(lang):
    # initialize the array
    # don't allow any zeros, set to 0.1
    fm = zeros( (27,27) )
    # remove zero probabilities
    #fm = fm + 0.1

    # get the data from the file
    # walk a two-char pair
    f = open(lang).read()
    cnt = 0
    while cnt < (len(f)-1):
        twochr = [f[cnt],f[cnt+1]]
        fm[filter_char(twochr[0])][filter_char(twochr[1])] += 1.0
        cnt += 1

    # probability matrix
    pm = fm / len(f)
    return (fm,pm)

# return sentences
def get_sentences(doc):
    # split into sentences
    s = doc.split('.')
    # remove meaningless cruft
    m = map(strip, s)
    # remove EOF
    m.pop()
    return m

# return weighted items from dict
def weighted_items(d):
    items = d.items()
    items = [(v, k) for (k, v) in items]
    items.sort()
    items.reverse()
    # so largest is first
    items = [(k, v) for (v, k) in items]
    return items

# create a 2D bar graph for graphical representation
# takes a frequency matrix as input
# <TODO>
def make_bar_graph_2d(fm):
    alpha = '*abcdefghijklmnopqrstuvwxyz'

# top-level language learning function
def learn(lang):
    (fm,pm) = make_freq_prob_mat(lang)
    return pm

# top-level language testing function
def test(file):
    f = open(file).read()
    sents = get_sentences(f)
    for s in sents:
        # avoid very short fragments
        if len(s) < 3:
            continue
        print s
        p_tot = 0
        p_rng = 0
        p_arr = []
        for p in langpms:
            ccnt = 0
            # start it off very low, to avoid zero divides [/hack]
            runprb = 0.000000000000000001
            while ccnt < (len(s)-1):
                twochr = [s[ccnt],s[ccnt+1]]
                # avoid letting percentages build from non-char data
                if filter_char(twochr[0]) == 0 or filter_char(twochr[1]) == 0:
                    ccnt += 1
                    continue
                else:
                    runprb += p[filter_char(twochr[0])][filter_char(twochr[1])]
                ccnt += 1
            p_tot += runprb
            p_rng += 1
            p_arr.append(runprb)
        i_rng = 0
        dict = {}
        for i in p_arr:
            dict[langfiles[i_rng]] = [i / p_tot]
            i_rng += 1
        items = weighted_items(dict)
        # most likely candidate
        print '\t*', upper(items[0][0]), '=>', items[0][1]
        # the other candidates
        for i in items:
            if i[0] == items[0][0]:
                continue
            else:
                print '\t-', i[0], '=>', i[1]
        print ''

# our languages
langpms = []
for l in langfiles:
    langpms.append(learn(l))

if len(sys.argv) != 2:
    print 'usage:', sys.argv[0], 'test-data'
else:
    test(sys.argv[1])





Just be aware that this code will not detect the sentences as
intelligently as possible, but still has decent accuracy.  You should
be able to differentiate between the sentence "I love you" in numerous
languages with great accuracy.  Markov is a great AI technique, but I
kinda half-assed it in this code :-P  You need the data files to run
this, but looking at the code can give you an idea of how it works.
Email me privately and I can send you all the data if you really want
to see it in action...
-- 
Kristian Erik Hermansen
--
"It has been just so in all my inventions. The first step is an
intuition--and comes with a burst, then difficulties arise. This thing
gives out and then that--'Bugs'--as such little faults and
difficulties are called--show themselves and months of anxious
watching, study and labor are requisite before commercial success--or
failure--is certainly reached" -- Thomas Edison in a letter to
Theodore Puskas on November 18, 1878
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug