yegg on 31 Jul 2008 08:49:22 -0700 |
I'm interested in getting back into this stuff, so I messed around with this for a while, though I'm not sure how useful my results are... I'm a little confused by your terms, i.e. what "identify Person Y" means. It seems to mean to want to follow person Y on twitter. I'm also a little confused about the universe here. It seems that it is all twitter users, e.g.: P(Toby) = probability you want to follow Toby on twitter given you are a twitter user. P(lambda) = probability someone is in the Philly lambda group given they are on twitter. Bayes theorem is just a relation of conditional probabilities. So it is helpful when you have an a priori belief (default probability or hypothesis), and you want to use an observation to revise that probability given you know the other conditional probability. But in your case, you asked what you wanted to know directly, i.e. how many of the twitter users in lambda are following Toby. And you got an answer, i.e. 3/5 = .6, which seems to be your best guess for whether a random member you don't know about would want to follow Toby (given no other factors), e.g. P(Toby|lambda). So, given what you asked, you seem more primed to calculate the other conditional probability, i.e.: P(lambda|Toby) = P(lambda)*P(Toby|lambda)/P(Toby) where P(lambda|Toby) = P(someone wants to join Philly lambda given they follow Toby on twitter). P(Toby) = 144 followers/1M twitters users = 0.000144, P(lambda) = 75 members/1M twitter users = .000075 (assuming everyone is on twitter), and therefore: P(lambda|Toby) = .000075*.6/.000144 = 0.3125. If you go back to the other way, you could have taken a reasonable a priori guess, e.g. P(Toby) = 0.000144, which is of course really low. Now you observe someone is in Philly lambda, and so you want to revise P(Toby) accordingly. P(Toby|lambda) = P(someone wants to follow Toby on twitter given they are also in Philly lambda (and on twitter)). P(Toby|lambda) = P(Toby)*P(lambda|Toby)/P(lambda) where P(lambda|Toby) = *guess* 30 members/144 followers = 0.2083 (presumably scaled up from a sampling). So P(Toby|lamda) = 0.4. Of course you could have just done 30/75. The point here is that Bayes theorem isn't going to, applied in one case by itself, tell you much that you didn't already know. However, there are at least a couple things you could do with it: 1) Calculate P(Toby|x) for an array of characteristics and see which revise your original estimate of P(Toby) the greatest. That will tell you which factors seem to have the most impact in wanting to follow Toby. 2) Incorporate all these factors into a Bayes classifier (http:// en.wikipedia.org/wiki/Naive_Bayes_classifier, http://www.statsoft.com/textbook/stnaiveb.html). I don't have much experience with this, but it seems that it will help you incorporate all your factors into one final probability given the observation of those factors through repeated calculations of Bayes' theorem. The other approach would be frequentist. You have a set of variables and want to predict another given those variables--that's regression analysis (http://en.wikipedia.org/wiki/Regression_analysis). The simplest method would be least squares linear regression. Of course, if the problem isn't linear or has other problems, e.g. outliers or missing data, linear regression might not be right. But it's a good start, and you can branch out to other techniques from there. Another thing you might want to look at is model selection, which comprises techniques to make sure your chosen variables are actually correct inputs into regressions or other algorithms, e.g. aren't significantly covariant with other variables or otherwise irrelevant. On Jul 30, 9:24 pm, "Steve Eichert" <steve.eich...@gmail.com> wrote: > Hey All, > > I recently read Collective Intelligence and it sparked a lot of interest for > me in machine learning. I'm having some trouble figuring out how to make > the leap from what's discussed in the book to other real world examples. > This is a contrived example but humor me :) I'd love some help from those > in the group in understanding the different methods discussed in CI since > I'm not making out that well on my own. > > So onto my contrived example. Lets say I have a list of people along with > some attributes (city, state, UG affiliation) about the people. A sample is > below in CSV format > > Name, City, State, Primary User Group Affiliation > Steve, Jenkintown, PA, Philly Lambda > Kyle, Somewhere, PA, Philly Lambda > Geoge, Elsewhere, NJ, Philly .NET > Randy, Landsdale, PA, Philly on Rails > Aaron, Collingswood, NJ, Philly Lambda > Toby, Topsecretville, PA, Philly Lambda > Jonathan, PLPatternville, NJ, Philly Lambda > > I've asked all these people who they follow on Twitter. I hear back from > some people and not others. The data I did receive is below: > > Person, Followers (pipe separated) > Steve, Toby|Kyle|Aaron > Kyle, Toby|Jonathan|Andrew > George, Blah > Aaron, Toby > > Again please forgive the contrived example. What I would like to be able to > do is figure out the probability that someone who didn't respond would > follow a person followed by one of the people who did respond. The theory > is that by looking at the common attributes of the people who are following > a particular person, you may be able to assume that someone else with the > same, or similar, attributes would also follow that person. > > For example, in the example dataset, we see that Steve, Kyle, and Aaron all > belong to Philly Lambda and they all follow Toby on Twitter. Given this, > how could we calculate the probability/likelihood that Jonathan follows Toby > on twitter, given that he also listed Philly Lambda as his primary user > group. Taking this to the next step, given all the attributes that we have > (city, state, ug) how can we figure out the overall probability given all > the attributes. And secondarily, how could we identify the best attribute > for predicting whether or not someone would follow someone else on Twitter? > > I was originally experimenting with Bayes theorem (http://en.wikipedia.org/wiki/Bayes'_theorem), but after spending a little > bit of time I'm either not smart enough to know how it could be applied > (very likely), or its not a good candidate. How would you go about > solving/figuring out this? > > With Bayes I was trying to take the following approach: > > formula: P(A|B) = P(B|A)*P(A)/P(B) > > What's the probability that Person X would identify Person Y, given that > Person X is in the Philly Lambda user group? > > A = Person X will identify Person Y > B = Person X is in the Philly Lambda user group > > However, in order to take this approach I believe I would need to know the > probability that person X will identify person Y, which is what I'm trying > to figure out. I'm pretty clueless, and this whole experience has made me > realize that whatever math skills I previously had have gone down the > tubes. It would be noble of you to help me get on the path towards getting > them back! > > I realize this is pretty off topic so if people would prefer the discussion > happen off list let me know. I wanted to drop a note here because I figured > out of the different groups which I have connections with, this would be the > most likely to have someone who may be able to assist. I also realize that > this is probably pretty straight forward to some of you, however, for > someone (me) who has had their brain melted by monotonous data collection > applications it requires someone with more intellect to help :) > > Cheers, > Steve
|
|