Re: [PLUG] Lingua::Identify languages

Kristian Erik Hermansen on 20 Feb 2008 13:10:16 -0800

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Lingua::Identify languages

From: "Kristian Erik Hermansen" <kristian.hermansen@gmail.com>

To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>

Subject: Re: [PLUG] Lingua::Identify languages

Date: Wed, 20 Feb 2008 13:10:10 -0800

Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>

Sender: plug-bounces@lists.phillylinux.org

#!/usr/bin/env python # Kristian Hermansen <kristian.hermansen@gmail.com> # Version: 20070305 # Markov-chain Language Learner (digrams) # Learns a language using training data files in 'langfiles'. # Then takes test input and determines the most likely language. # And fo fun, I added support for Latin as well ;-) import sys from numpy import * from string import * # language files with training data #langfiles = ['english', 'french', 'german', 'italian', 'portuguese', 'spanish'] langfiles = ['english', 'french', 'german', 'italian', 'latin', 'portuguese', 'spanish'] # returns the ascii value offset for valid chars # or 0 if not a valid learning char. def filter_char(c): # 'a' == 97, so -96 gets us to 1 offset = 96 c = ord(lower(c)) - offset if c >= 1 and c <= 26: return c else: return 0 # create frequency/probability matrices def make_freq_prob_mat(lang): # initialize the array # don't allow any zeros, set to 0.1 fm = zeros( (27,27) ) # remove zero probabilities #fm = fm + 0.1 # get the data from the file # walk a two-char pair f = open(lang).read() cnt = 0 while cnt < (len(f)-1): twochr = [f[cnt],f[cnt+1]] fm[filter_char(twochr[0])][filter_char(twochr[1])] += 1.0 cnt += 1 # probability matrix pm = fm / len(f) return (fm,pm) # return sentences def get_sentences(doc): # split into sentences s = doc.split('.') # remove meaningless cruft m = map(strip, s) # remove EOF m.pop() return m # return weighted items from dict def weighted_items(d): items = d.items() items = [(v, k) for (k, v) in items] items.sort() items.reverse() # so largest is first items = [(k, v) for (v, k) in items] return items # create a 2D bar graph for graphical representation # takes a frequency matrix as input # <TODO> def make_bar_graph_2d(fm): alpha = '*abcdefghijklmnopqrstuvwxyz' # top-level language learning function def learn(lang): (fm,pm) = make_freq_prob_mat(lang) return pm # top-level language testing function def test(file): f = open(file).read() sents = get_sentences(f) for s in sents: # avoid very short fragments if len(s) < 3: continue print s p_tot = 0 p_rng = 0 p_arr = [] for p in langpms: ccnt = 0 # start it off very low, to avoid zero divides [/hack] runprb = 0.000000000000000001 while ccnt < (len(s)-1): twochr = [s[ccnt],s[ccnt+1]] # avoid letting percentages build from non-char data if filter_char(twochr[0]) == 0 or filter_char(twochr[1]) == 0: ccnt += 1 continue else: runprb += p[filter_char(twochr[0])][filter_char(twochr[1])] ccnt += 1 p_tot += runprb p_rng += 1 p_arr.append(runprb) i_rng = 0 dict = {} for i in p_arr: dict[langfiles[i_rng]] = [i / p_tot] i_rng += 1 items = weighted_items(dict) # most likely candidate print '\t*', upper(items[0][0]), '=>', items[0][1] # the other candidates for i in items: if i[0] == items[0][0]: continue else: print '\t-', i[0], '=>', i[1] print '' # our languages langpms = [] for l in langfiles: langpms.append(learn(l)) if len(sys.argv) != 2: print 'usage:', sys.argv[0], 'test-data' else: test(sys.argv[1]) Just be aware that this code will not detect the sentences as intelligently as possible, but still has decent accuracy. You should be able to differentiate between the sentence "I love you" in numerous languages with great accuracy. Markov is a great AI technique, but I kinda half-assed it in this code :-P You need the data files to run this, but looking at the code can give you an idea of how it works. Email me privately and I can send you all the data if you really want to see it in action... -- Kristian Erik Hermansen -- "It has been just so in all my inventions. The first step is an intuition--and comes with a burst, then difficulties arise. This thing gives out and then that--'Bugs'--as such little faults and difficulties are called--show themselves and months of anxious watching, study and labor are requisite before commercial success--or failure--is certainly reached" -- Thomas Edison in a letter to Theodore Puskas on November 18, 1878 ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug

References:

[PLUG] Lingua::Identify languages
From: Walt Mankowski <waltman@pobox.com>

Re: [PLUG] Lingua::Identify languages
From: "Kristian Erik Hermansen" <kristian.hermansen@gmail.com>

Prev by Date: Re: [PLUG] Lingua::Identify languages

Next by Date: Re: [PLUG] Date a Debian package was installed

Previous by thread: Re: [PLUG] Lingua::Identify languages

Next by thread: [PLUG] Ubuntu NJ LoCo Team Linux Hack Fest

Index(es):

Date

Thread