Statistical Variability: January 2011

Monday, January 24, 2011

The Black Sheep

An Mechanical Engineer, a Physicist and a Mathematician are on a train going to a conference in Scotland . As they come out of a tunnel just entering Scotland the Engineer spots a sheep on a hill.

"hey look guys a Black sheep, I guess all sheep in Scotland are Black." says the Engineer

"No silly, just that one sheep is Black in Scotland," Says the Physicist.

"You're both wrong, always generalizing," says the Mathematician. "We only know there is one side of one sheep in Scotland that is black."

Friday, January 21, 2011

Let's make a deal

Back in the 1970s there was a game show called Let's Make a Deal, hosted by Monty Hall, which created quite a stir when statistics got involved.

The game show went like this. There are three doors; behind one of the doors is a brand new car (Or similarly nice prize) and behind the other two is a goat. And no you don't want a goat. The contestent is then asked to pick a door. Monty Hall would then "reveal" one of the two remaining doors to show one of the goats. This left two unopened doors.
File:Monty open door.svg

The challenge now is, "would you like to stay with the door you first chose, or switch to the other unopened door". The statistics question is; what is the probability that the other unopened door has the car behind it?

Initially people thought that it was a simple 50:50 bet. There are two doors, one has a car, one does not. But this is why we have statistics, to go beyond common sense and discover strange underlying probabilities. The real controversy started when Marilyn vos Savant's "Ask Marilyn" column in Parade magazine in 1990. She stated that one should always switch. Numerous readers including PhD's and respected statisticians wrote in to her "switch tactic was wrong. Here's why she is right.

Let's say you pick door number one. The probability the car is behind it is then 1/3, since one door has a car, and two have goats. Therefore the probability that the other two doors (2 or 3) contain the car is 2/3. When a goat is revealed to be behind door number 2, that probability does not change. Thee is still a 2/3 probability that the car is behind door number 2 or 3. Obviously don't pick 2, so the probability of door number 3 having the car is 2/3.

Seen another way. All possible setups of doors are (C=Car, G=Goat)

CGG

GCG

GGC

If we pick the first door, then Monty would open these doors

So if we switch our door in the first setup, we loose. If we change on the second we win, and the third we win. Therefore we wind 2/3 of the time. If you don't believe me here's a link to an applet, http://math.ucsd.edu/~crypto/Monty/monty.html. You can play the game until your hearts content, and as you'll notice, you'll win 2/3s of the time if you switch your door. Ain't Stats fun.

Tuesday, January 18, 2011

What is Probability

Probability is the likelihood that an event will happen, or has happened. In order to project into the future we pretend everything has happened and see what proportion were a success.

In everyday language we often express this as a percent and as most people has difficulty working with decimals. No other reason. 9 out of nine is 1 not 100%. While we're on the topic the conversion to percent should be written (9/9)% or (9/9)x100, not both. Per Cent means per 100, therefore (9/9)x100% is 1 times 100 per 100 which is 1. Try it on your calculator, 1 times 100 then %.

The way that we use percentages and probabilities also differ slightly in everyday language. We talk about percentages as the proportion of successes in the total, but a probability often refers proportion of times a single event will be a success.The meaning is just ever so slightly different, the result is the same often, but the path to get there is different.

In school we say that a test that got 13 out of 100 question right got 13%. And could have been written by a monkey guessing randomly. But that's wrong.... The bit about the monkey. No the monkey couldn't have guessed randomly, and certainly neither could a student, but that's another story. The 13% is right though. They got 13% of the questions right. But that doesn't really tell us anything, it just clumps all the data into one point. In probability we would say that, the probability of any particular question being right, is 13%. Let's think about that. I pick one question from the test there is a 13% change that it is right.

This shift in thinking is one of the first challenges made by all students takeling their first statistics course. Fundamentally we want to know the behaviour of a given individual. But we can only determine this by figuring out the behaviour of the entire population and hoping that the individual will land near an "average" individual from the population. Once you get past this many of the "calculations" of higher order probabilities are quite simple.

If I roll 2 dice, what is the probability that one is a 4 given that they add to 9. "Not a clue, no idea where to start." Well think about one instance, and start with what we know. We know they add to 9, therefore the dice must be one of– 4,5; 5,4; 6,3; 3,6. That's 4 different pairs. Which ones of them have a 4; two of them. Therefore the probability that one is a 4 given that they add to 9 is 2/4, or one half.

How does this relate? What we did differently here was to look at the individual in order to extrapolate. So just like the test, instead of looking at one question and trying to guess the probability if will be right, we looked at every one, determined the probability of a correct answer on average, then the probability of a success for that individual should be the same.

And this applies to all of statistics. Every statistics problem is simply determining the number of successes and failures in every possible outcome in order to determine the probability of a single success on a single individual, and sometimes we've come up with nice formulae to make quick calculation shortcuts. It's all the same. Determine every single possibility, figure out how many of them are successes, successes out of total equal probability.

And that's probability.

Monday, January 10, 2011

Standard Deviation Explained

One of the best things about statistics is that it's not really that old as a science. The language that is used is fairly recent, kind of an old english stodginess to it. Most terms just have to be taken back to their strict dictionary definition to be understood. So standard deviation simply means, The Standard... Deviation.

You know both of these words, we just don't tend to use them a lot in every day language. At least not together, or in this way. The American Oxford Dictionary defines standard, as

Standard: used or accepted as normal or average : the standard rate of income tax | it is standard practice in museums to register objects as they are acquired.

Something that you expect to have happen, or what you would expect to see on Average. Let's go with that, the average.

Now, we often hear deviation in one of its variants; deviate or deviant. "You deviated from what you were told," meaning not quite correct. Or, "they're a social deviant", meaning unacceptable. So deviation means something has changed from what was expected. Together they give us,

The Average Change from the Expected.

If you take the whole data set, and take the average difference between each point and what is expected you get the standard deviation. In statistics this is often reported as a single value and will have the same units as the expected value. For instance, men in Canada are 1.736 m on average with a standard deviation of 16cm (Made this up, I still can't find it). That means that the average Canadian male height is on average 16cm from 1.736 m. Yes this tells you nothing about the individual person, or what you might expect to see in a small group, but it does tell you information about all Canadians in general.

Friday, January 7, 2011

The statistics of un-reportable data

According to Stockwell Day says unreported crimes are rising Statistics Canada reported that only 34% of Victims of Crime reported to the police. You ever wonder how they know that? Like really? If the crime goes unreported, how could we make a statistic based on it?

Well yes, there is Big Brother, and of course eye witness accounts, but those would might be put under reported crime. One way is to have a blind survey where there is an assurance that the answers cannot be linked to the surveyor, something like the Canadian Census. Oh wait, the old Canadian Census. The new one couldn't tell you... well anything. Either way, people are asked to recount any crime they've been a part of, as victim or criminal, along with the date, then these statistics are matched up with current crime rate data, and any discrepancy is assumed to be the unreported portion. But that leaves one major question. What is the variance?

Now anybody can make up any statistic by simply using ANY data set. It's a perfectly valid statistical procedure. For instance, I could use the shrub in my front yard to estimate the average height of people in Canada. Nothing wrong with that. "I think people in Canada are 12ft tall on average." This in not wrong, it's an estimate. What we assume when we hear a statistic from a body like Statistics Canada though, is that the Variance, or general variability of the statistic is small. My shrub estimator for instance probably has a variance of about 100ft. That is, people in Canada are on average 12ft tall give or take 100ft. Woowee, that's really helpful, people are between 0 and 112ft tall (assumed nobody has a negative height).

And that's what you should ask yourself with the unreported crime data set. Yes there are only 34% of crimes reported, but plus or minus what. Unfortunately very few statistics in the news are reported with standard deviation (the statistical term for variability), or even any sort of hint as to how accurate the measure even is. I, for example, can't find the standard deviation of male heights in Canada to save my life (If you find one please comment, first one gets a free lolli).

Since the crime estimate is probably based on a survey, one would need to know the sample size and reliability of the survey to really get an idea of how good a number it is. People wouldn't necessarily put down crime, so it may even be unreported. Although negative statistics like crime are often inflated to compensate. Also the demographic coverage would be key. If you live in Northern Ontario you wouldn't expect the percentage of unreported crime to be the same as manhattan. Or would you? So in the end it all comes down to extrapolation and what the data is based on. How far can I stretch the claim about this data, and is their situation anything like mine.

Thursday, January 6, 2011

The Probability of High Dice

So my bother wanted to know. What is the probability of getting a particular value on a number of different dice. With one catch. He wanted to know how many dice would have to be thrown to be almost positive to get the highest face value to be a 6, or no bigger than a 5, or a 4 etc. In essence when r dice are thrown, what is the probability of getting no number greater than k.

As he found, this is fairly easy to do by hand for small numbers of dice, about 3 or less. However after three it's almost impossible, without destroying a small forest, to list all possibilities of combinations for say 4, 5 or more dice. The catch is that at least one of the dice has to attain the desired value and the rest can be equal or smaller than that. Initially I didn't realize this contraint and simply added probabilities of every possible combination of dice including triples of that value. Although we did come up with a stack of other interesting setups, and a number of cool points, we finally got the following:

Once you had the setup right it isn't to hard. Essetially all possible combinations of the dice, minus the combinations that don't include k, the highest number. The interesting point though, is that although you are more likely to get a 6, good for initiative addition or something in the game, the probability of getting the smaller numbers drops of dramatically as the number of dice increases. producing a rather cool graph. Which for the math nerds contains a saddle point if you consider the numbers on the real line (i.e. less than 1 on a die)

From left to right we have r, the number of dice, increasing from 1 to 20; From front to back is k, the maximum value on the dice, increasing from 1 to 6. The upshot, and possible cool rule to add is what we dubbed the Clouseau multiplier.

A advanced experienced character with 10 dice to roll has a whopping 83% chance of rolling a 6 and a 13% chance of a 5, so we can be assured that the character will score a 5 or a 6. Unlike a lowley newbie with only a 18% of getting a 6, or any number for that matter. However, there is almost 0 possibility of getting all ones if you roll 10 dice. So if a player does succeed in getting 1 as their highest value make it a critical hit. "What does "self destruct" mean?"

And for those that are awed by those rare occasions. Lets say your character is super powerful and has managed to get to level 42 where you can roll a total of 21 six sided dice. The probability that you get at least one 6 is only.. wait for it... 98%. It's high, but that means that once every 50 games you won't get a single 6. Which is probably every day!