yegg on 31 Jul 2008 08:49:22 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: collective intelligence - bayes theorem help

  • From: yegg <gabriel.weinberg@gmail.com>
  • To: Philly Lambda <philly-lambda@googlegroups.com>
  • Subject: Re: collective intelligence - bayes theorem help
  • Date: Thu, 31 Jul 2008 08:49:14 -0700 (PDT)
  • Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:received:x-sender:x-apparently-to :mime-version:received:date:in-reply-to:x-ip:references:user-agent :x-http-useragent:message-id:subject:from:to:content-type :content-transfer-encoding:reply-to:sender:precedence:x-google-loop :mailing-list:list-id:list-post:list-help:list-unsubscribe :x-beenthere; bh=nn56AQ8fQzTSqraKnNPZCK+qHxobqwSwwwQfn3TS57E=; b=Iu53N0cTuCP++q1JZ2eZm2MokQRCJ5Y1Y/eYsWnTi5BPLVnKb02SvztkA8ZU7zibGL SlLC8qboehccAhbU7qjHsBjYTrQXMRpSyQeVD1w1Vo9BCaNO8Q/vfQWaH6cSRpLh/Ynr HPiYpochNB6+M0eP+M/YHNJs6EuS2IBhPo5s8=
  • Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owner@googlegroups.com
  • Reply-to: philly-lambda@googlegroups.com
  • Sender: philly-lambda@googlegroups.com
  • User-agent: G2/1.0

I'm interested in getting back into this stuff, so I messed around
with this for a while, though I'm not sure how useful my results
are...

I'm a little confused by your terms, i.e. what "identify Person Y"
means.  It seems to mean to want to follow person Y on twitter.  I'm
also a little confused about the universe here.  It seems that it is
all twitter users, e.g.:

P(Toby) = probability you want to follow Toby on twitter given you are
a twitter user.
P(lambda) = probability someone is in the Philly lambda group given
they are on twitter.

Bayes theorem is just a relation of conditional probabilities.  So it
is helpful when you have an a priori belief (default probability or
hypothesis), and you want to use an observation to revise that
probability given you know the other conditional probability.

But in your case, you asked what you wanted to know directly, i.e. how
many of the twitter users in lambda are following Toby.  And you got
an answer, i.e. 3/5 = .6, which seems to be your best guess for
whether a random member you don't know about would want to follow Toby
(given no other factors), e.g. P(Toby|lambda).

So, given what you asked, you seem more primed to calculate the other
conditional probability, i.e.:

P(lambda|Toby) = P(lambda)*P(Toby|lambda)/P(Toby)

where

P(lambda|Toby) = P(someone wants to join Philly lambda given they
follow Toby on twitter).
P(Toby) = 144 followers/1M twitters users = 0.000144,
P(lambda) = 75 members/1M twitter users = .000075  (assuming everyone
is on twitter),

and therefore:

P(lambda|Toby) = .000075*.6/.000144 = 0.3125.



If you go back to the other way, you could have taken a reasonable a
priori guess, e.g.

P(Toby) = 0.000144, which is of course really low.

Now you observe someone is in Philly lambda, and so you want to revise
P(Toby) accordingly.

P(Toby|lambda) = P(someone wants to follow Toby on twitter given they
are also in Philly lambda (and on twitter)).

P(Toby|lambda) = P(Toby)*P(lambda|Toby)/P(lambda) where

P(lambda|Toby) = *guess* 30 members/144 followers = 0.2083 (presumably
scaled up from a sampling).

So P(Toby|lamda) = 0.4.  Of course you could have just done 30/75.


The point here is that Bayes theorem isn't going to, applied in one
case by itself, tell you much that you didn't already know.  However,
there are at least a couple things you could do with it:

1) Calculate P(Toby|x) for an array of characteristics and see which
revise your original estimate of P(Toby) the greatest.  That will tell
you which factors seem to have the most impact in wanting to follow
Toby.

2) Incorporate all these factors into a Bayes classifier (http://
en.wikipedia.org/wiki/Naive_Bayes_classifier, http://www.statsoft.com/textbook/stnaiveb.html).
I don't have much experience with this, but it seems that it will help
you incorporate all your factors into one final probability given the
observation of those factors through repeated calculations of Bayes'
theorem.

The other approach would be frequentist.  You have a set of variables
and want to predict another given those variables--that's regression
analysis (http://en.wikipedia.org/wiki/Regression_analysis).  The
simplest method would be least squares linear regression.  Of course,
if the problem isn't linear or has other problems, e.g. outliers or
missing data, linear regression might not be right.  But it's a good
start, and you can branch out to other techniques from there.  Another
thing you might want to look at is model selection, which comprises
techniques to make sure your chosen variables are actually correct
inputs into regressions or other algorithms, e.g. aren't significantly
covariant with other variables or otherwise irrelevant.


On Jul 30, 9:24 pm, "Steve Eichert" <steve.eich...@gmail.com> wrote:
> Hey All,
>
> I recently read Collective Intelligence and it sparked a lot of interest for
> me in machine learning.  I'm having some trouble figuring out how to make
> the leap from what's discussed in the book to other real world examples.
> This is a contrived example but humor me :)  I'd love some help from those
> in the group in understanding the different methods discussed in CI since
> I'm not making out that well on my own.
>
> So onto my contrived example.  Lets say I have a list of people along with
> some attributes (city, state, UG affiliation) about the people.  A sample is
> below in CSV format
>
> Name, City, State, Primary User Group Affiliation
> Steve, Jenkintown, PA, Philly Lambda
> Kyle, Somewhere, PA, Philly Lambda
> Geoge, Elsewhere, NJ, Philly .NET
> Randy, Landsdale, PA, Philly on Rails
> Aaron, Collingswood, NJ, Philly Lambda
> Toby, Topsecretville, PA, Philly Lambda
> Jonathan, PLPatternville, NJ, Philly Lambda
>
> I've asked all these people who they follow on Twitter.  I hear back from
> some people and not others.  The data I did receive is below:
>
> Person, Followers (pipe separated)
> Steve, Toby|Kyle|Aaron
> Kyle, Toby|Jonathan|Andrew
> George, Blah
> Aaron, Toby
>
> Again please forgive the contrived example.  What I would like to be able to
> do is figure out the probability that someone who didn't respond would
> follow a person followed by one of the people who did respond.  The theory
> is that by looking at the common attributes of the people who are following
> a particular person, you may be able to assume that someone else with the
> same, or similar, attributes would also follow that person.
>
> For example, in the example dataset, we see that Steve, Kyle, and Aaron all
> belong to Philly Lambda and they all follow Toby on Twitter.  Given this,
> how could we calculate the probability/likelihood that Jonathan follows Toby
> on twitter, given that he also listed Philly Lambda as his primary user
> group.  Taking this to the next step, given all the attributes that we have
> (city, state, ug) how can we figure out the overall probability given all
> the attributes.  And secondarily, how could we identify the best attribute
> for predicting whether or not someone would follow someone else on Twitter?
>
> I was originally experimenting with Bayes theorem (http://en.wikipedia.org/wiki/Bayes'_theorem), but after spending a little
> bit of time I'm either not smart enough to know how it could be applied
> (very likely), or its not a good candidate.  How would you go about
> solving/figuring out this?
>
> With Bayes I was trying to take the following approach:
>
> formula: P(A|B) = P(B|A)*P(A)/P(B)
>
> What's the probability that Person X would identify Person Y, given that
> Person X is in the Philly Lambda user group?
>
> A = Person X will identify Person Y
> B = Person X is in the Philly Lambda user group
>
> However, in order to take this approach I believe I would need to know the
> probability that person X will identify person Y, which is what I'm trying
> to figure out.  I'm pretty clueless, and this whole experience has made me
> realize that whatever math skills I previously had have gone down the
> tubes.  It would be noble of you to help me get on the path towards getting
> them back!
>
> I realize this is pretty off topic so if people would prefer the discussion
> happen off list let me know.  I wanted to drop a note here because I figured
> out of the different groups which I have connections with, this would be the
> most likely to have someone who may be able to assist.  I also realize that
> this is probably pretty straight forward to some of you, however, for
> someone (me) who has had their brain melted by monotonous data collection
> applications it requires someone with more intellect to help :)
>
> Cheers,
> Steve