Steve Eichert on 31 Jul 2008 11:57:06 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: collective intelligence - bayes theorem help

  • From: "Steve Eichert" <steve.eichert@gmail.com>
  • To: philly-lambda@googlegroups.com
  • Subject: Re: collective intelligence - bayes theorem help
  • Date: Thu, 31 Jul 2008 14:56:58 -0400
  • Authentication-results: mx.google.com; spf=pass (google.com: domain of steve.eichert@gmail.com designates 74.125.44.30 as permitted sender) smtp.mail=steve.eichert@gmail.com; dkim=pass (test mode) header.i=@gmail.com
  • Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=beta; h=domainkey-signature:received:received:x-sender:x-apparently-to :received:received:received-spf:authentication-results:received :dkim-signature:domainkey-signature:received:received:message-id :date:from:to:subject:in-reply-to:mime-version:content-type :references:reply-to:sender:precedence:x-google-loop:mailing-list :list-id:list-post:list-help:list-unsubscribe:x-beenthere; bh=V8C4goC8uXO7VhTbKyWY8mT5RbHDmybLbIXpXIVXidI=; b=K9eFiVkLWiSjSvK0NB6Wr8rAi9bctANj3g4uqksiPlWeKUuEIiAa0/CzFIZA/k8oAi 5EuZ7kUssdhRDJZP6WceWF2+dvKX5Rrr5bE9qnJSkVgAuvQtf3ET4eqxPLLZle1kELqq r6arT3Xwbx/KnHl2cBT7mAms3EWx+2shRD8ok=
  • Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=I20eaKKrMjHwPMSLxxlgf0jGA8Stl1CXHYHPF5M22+I=; b=B9NklbHpkuS8c4kBGd0ucdogGpOy7CSINMkrx1k+gUh0GR3ZoySDrhlSS8U5Tq0yvU mriyzNeB6oIxUQbWGfTe8bLRZCsy34ue/B0vaWXJI9qYwHxJFN8Mbctd4DzCW1+wdBIr pDd4zdvFARs80cXsLl80iZKBi0sP/OrNlORKc=
  • Mailing-list: list philly-lambda@googlegroups.com; contact philly-lambda+owner@googlegroups.com
  • Reply-to: philly-lambda@googlegroups.com
  • Sender: philly-lambda@googlegroups.com

I'm sure my terms are quite confusing, since the example is somewhat contrived.  Let me restate my goal to see if it clarifies anything.

I want to determine the probability that one person (Jonathan) will follow another person (Toby) on Twitter by examining the attributes of those people who we know follow Toby (Steve, Kyle, Aaron).  I think the simplest way to go about doing this without getting wrapped up in theorems and such is to figure out what the probability is that a person with a particular attribute follows Toby.  So to do this, I would get a count of all the people who follow Toby with that particular attribute, and divide it by the total number of people that responded with that attribute (regardless of whether they follow Toby) to determine the probability of someone with that attribute following Toby.  In our sample that would be:

Review of data (with responded attribute added)

Name, City, State, Primary User Group Affiliation,Responded
Steve, Jenkintown, PA, Philly Lambda,Y
Kyle, Somewhere, PA, Philly Lambda,Y
Geoge, Elsewhere, NJ, Philly .NET,Y
Randy, Landsdale, PA, Philly on Rails,Y
Aaron, Collingswood, NJ, Philly Lambda,Y
Toby, Topsecretville, PA, Philly Lambda,Y
Jonathan, PLPatternville, NJ, Philly Lambda,N

Person, Followers
Steve, Toby|Kyle|Aaron
Kyle, Toby|Jonathan|Andrew
George, Blah
Aaron, Toby

3 people are in lambda who follow Toby / 4 total people are in lambda who responded

Since 1 of the 4 is Toby himself, and he can't follow himself I think we'd have:

3/3 = 1 = 100%

Given this, if we want to know the probability that Jonathan would follow Toby, given that he's in lambda, we'd say given our existing data the probability is 1 or 100% likely.

So, I'd also want to figure this out for the other attributes we have (city, state).  So if I want to know the probability of Jonathan following Toby given that he's from NJ I'd look at the total number of people following Toby from NJ and divide that by the total number of people who responded that are from NJ.

1 person from NJ who follows Toby (Aaron) / 2 people responded who live in NJ.

1/2 = .5

So there would be a 50% chance that Jonathan follows Toby given that he's from NJ.  So from what I understand, in order to find the probability that Jonathan follows Toby given that he's in Philly Lambda, and he's from NJ I would multiple the probabilities of each together.

1 * .5 = 50%

So I think that I could say that there's a 50% chance that a person from NJ and in Philly Lambda follows Toby.  Is that correct given this simplistic approach, or am I doing something wrong?

I don't know anything about the other stuff you mentioned (Bayes classifier, regression analysis) so I'll have to try and read a bit about them and see how I may be able to use them.  I also didn't really answer your questions, but hopeful that helps give you some more background regarding where I was approaching the problem from.  It's basically at this point that I start getting lost in how to apply bayes (and other algorithms) to my problem.  I think some of what everyone has said thus far has sunk in a bit so hopefully with further reflection / focus I'll find my way.

Thanks!
- Steve



On Thu, Jul 31, 2008 at 11:49 AM, yegg <gabriel.weinberg@gmail.com> wrote:

I'm interested in getting back into this stuff, so I messed around
with this for a while, though I'm not sure how useful my results
are...

I'm a little confused by your terms, i.e. what "identify Person Y"
means.  It seems to mean to want to follow person Y on twitter.  I'm
also a little confused about the universe here.  It seems that it is
all twitter users, e.g.:

P(Toby) = probability you want to follow Toby on twitter given you are
a twitter user.
P(lambda) = probability someone is in the Philly lambda group given
they are on twitter.

Bayes theorem is just a relation of conditional probabilities.  So it
is helpful when you have an a priori belief (default probability or
hypothesis), and you want to use an observation to revise that
probability given you know the other conditional probability.

But in your case, you asked what you wanted to know directly, i.e. how
many of the twitter users in lambda are following Toby.  And you got
an answer, i.e. 3/5 = .6, which seems to be your best guess for
whether a random member you don't know about would want to follow Toby
(given no other factors), e.g. P(Toby|lambda).

So, given what you asked, you seem more primed to calculate the other
conditional probability, i.e.:

P(lambda|Toby) = P(lambda)*P(Toby|lambda)/P(Toby)

where

P(lambda|Toby) = P(someone wants to join Philly lambda given they
follow Toby on twitter).
P(Toby) = 144 followers/1M twitters users = 0.000144,
P(lambda) = 75 members/1M twitter users = .000075  (assuming everyone
is on twitter),

and therefore:

P(lambda|Toby) = .000075*.6/.000144 = 0.3125.



If you go back to the other way, you could have taken a reasonable a
priori guess, e.g.

P(Toby) = 0.000144, which is of course really low.

Now you observe someone is in Philly lambda, and so you want to revise
P(Toby) accordingly.

P(Toby|lambda) = P(someone wants to follow Toby on twitter given they
are also in Philly lambda (and on twitter)).

P(Toby|lambda) = P(Toby)*P(lambda|Toby)/P(lambda) where

P(lambda|Toby) = *guess* 30 members/144 followers = 0.2083 (presumably
scaled up from a sampling).

So P(Toby|lamda) = 0.4.  Of course you could have just done 30/75.


The point here is that Bayes theorem isn't going to, applied in one
case by itself, tell you much that you didn't already know.  However,
there are at least a couple things you could do with it:

1) Calculate P(Toby|x) for an array of characteristics and see which
revise your original estimate of P(Toby) the greatest.  That will tell
you which factors seem to have the most impact in wanting to follow
Toby.

2) Incorporate all these factors into a Bayes classifier (http://
en.wikipedia.org/wiki/Naive_Bayes_classifier, http://www.statsoft.com/textbook/stnaiveb.html).
I don't have much experience with this, but it seems that it will help
you incorporate all your factors into one final probability given the
observation of those factors through repeated calculations of Bayes'
theorem.

The other approach would be frequentist.  You have a set of variables
and want to predict another given those variables--that's regression
analysis (http://en.wikipedia.org/wiki/Regression_analysis).  The
simplest method would be least squares linear regression.  Of course,
if the problem isn't linear or has other problems, e.g. outliers or
missing data, linear regression might not be right.  But it's a good
start, and you can branch out to other techniques from there.  Another
thing you might want to look at is model selection, which comprises
techniques to make sure your chosen variables are actually correct
inputs into regressions or other algorithms, e.g. aren't significantly
covariant with other variables or otherwise irrelevant.


On Jul 30, 9:24 pm, "Steve Eichert" <steve.eich...@gmail.com> wrote:
> Hey All,
>
> I recently read Collective Intelligence and it sparked a lot of interest for
> me in machine learning.  I'm having some trouble figuring out how to make
> the leap from what's discussed in the book to other real world examples.
> This is a contrived example but humor me :)  I'd love some help from those
> in the group in understanding the different methods discussed in CI since
> I'm not making out that well on my own.
>
> So onto my contrived example.  Lets say I have a list of people along with
> some attributes (city, state, UG affiliation) about the people.  A sample is
> below in CSV format
>
> Name, City, State, Primary User Group Affiliation
> Steve, Jenkintown, PA, Philly Lambda
> Kyle, Somewhere, PA, Philly Lambda
> Geoge, Elsewhere, NJ, Philly .NET
> Randy, Landsdale, PA, Philly on Rails
> Aaron, Collingswood, NJ, Philly Lambda
> Toby, Topsecretville, PA, Philly Lambda
> Jonathan, PLPatternville, NJ, Philly Lambda
>
> I've asked all these people who they follow on Twitter.  I hear back from
> some people and not others.  The data I did receive is below:
>
> Person, Followers (pipe separated)
> Steve, Toby|Kyle|Aaron
> Kyle, Toby|Jonathan|Andrew
> George, Blah
> Aaron, Toby
>
> Again please forgive the contrived example.  What I would like to be able to
> do is figure out the probability that someone who didn't respond would
> follow a person followed by one of the people who did respond.  The theory
> is that by looking at the common attributes of the people who are following
> a particular person, you may be able to assume that someone else with the
> same, or similar, attributes would also follow that person.
>
> For example, in the example dataset, we see that Steve, Kyle, and Aaron all
> belong to Philly Lambda and they all follow Toby on Twitter.  Given this,
> how could we calculate the probability/likelihood that Jonathan follows Toby
> on twitter, given that he also listed Philly Lambda as his primary user
> group.  Taking this to the next step, given all the attributes that we have
> (city, state, ug) how can we figure out the overall probability given all
> the attributes.  And secondarily, how could we identify the best attribute
> for predicting whether or not someone would follow someone else on Twitter?
>
> I was originally experimenting with Bayes theorem (http://en.wikipedia.org/wiki/Bayes'_theorem), but after spending a little
> bit of time I'm either not smart enough to know how it could be applied
> (very likely), or its not a good candidate.  How would you go about
> solving/figuring out this?
>
> With Bayes I was trying to take the following approach:
>
> formula: P(A|B) = P(B|A)*P(A)/P(B)
>
> What's the probability that Person X would identify Person Y, given that
> Person X is in the Philly Lambda user group?
>
> A = Person X will identify Person Y
> B = Person X is in the Philly Lambda user group
>
> However, in order to take this approach I believe I would need to know the
> probability that person X will identify person Y, which is what I'm trying
> to figure out.  I'm pretty clueless, and this whole experience has made me
> realize that whatever math skills I previously had have gone down the
> tubes.  It would be noble of you to help me get on the path towards getting
> them back!
>
> I realize this is pretty off topic so if people would prefer the discussion
> happen off list let me know.  I wanted to drop a note here because I figured
> out of the different groups which I have connections with, this would be the
> most likely to have someone who may be able to assist.  I also realize that
> this is probably pretty straight forward to some of you, however, for
> someone (me) who has had their brain melted by monotonous data collection
> applications it requires someone with more intellect to help :)
>
> Cheers,
> Steve