16 April 2010

Statistics II

Why don't statistics work backwards? There are actually many reasons, but three are key. First, statistics requires data and data requires theory. Second, theory is about causation and statistics are about correlation. Correlation does not imply causation. Third, it just can't work backwards.

1. You can't do statistics unless you have data, so data proceeds statistics. But you can't collect data without some theory that tells you which data is relevant. There is lots of unstructured information available about the world. It can't be analyzed unless its been sorted and classified, and since that necessarily must come before statistical analysis, it can only be sorted through theory. (Even with archival data, it only exists before it's relevant.) No one is ever going to regress corporate performance against CEO hair color because, even though the information is available, it makes no theoretical sense.

2. Theory is a proposed causal relationship between two constructs. Statistics depends upon the correlation between one or more independent variables (the possible cause(s)) and the dependent variable (the supposed effect). You can't get to theory from statistics because correlation does not imply causation, though causation requires correlation. There are lots of pairs of things that correlate, even very strongly, without being causally related. Sometimes that's because both are caused by some third construct. Consider, for example, this graph of the relationship between the number of lemons imported from Mexico and US highway fatalities:

(I stole this graph from Derek Lowe, who in turn stole it from a letter to the editor in the Journal of Chemical Information and Modeling (Johnson, 2008).)

"R2 = 0.97" means that Mexican lemons explain approximately 97% of the variability in US highway fatalities over time, an almost unheard of result in the social sciences and much better than lots of studies that have caused otherwise rational people to upset their way of life. Why shouldn't we, then, repeal the traffic laws and just start importing lemons willy-nilly? Because theory tells us that lemons don't cause a reduction in highway deaths, no matter what statistics tells us.

What's really going on in this graph is that we've gotten richer over time, mostly as a result of increased productivity resulting from advanced technology. Richer means that we're importing more luxuries, like Mexican lemons. Advanced technology means that cars can be safer and richer means that safer cars are affordable. (For all I know, it might also be that technology advances have made importing lemons from Mexico easier and cheaper, increasing demand.) In any event, imported lemons and highway fatalities only correlate with each other because they both have causal relationships with a missing factor or two. Only theory can tell us whether a factor is missing and what it might be. Even then, it usually doesn't.

3. The third reason we can't work backward is because that's just not how it works. This has to do with our acceptance of false positives, and with null hypothesis testing, so it's a little more complicated.

Statistics works by trying to figure out, if our two data sets really were random rather than systematically different based on the independent variable of interest, how likely it would be that we would get that distribution. Let's say we suspected, for example, that altitude effects whether a coin lands heads or tails. To test our theory, we flip a coin at ground level and at the summit of Mt. Everest. On the ground, we get 48 heads and 52 tails. At altitude, we get 62 heads and 38 tails. How likely is it that our distribution could happen at random? As it happens, I can test that: the chance of getting that distribution randomly is 0.047, or just under 5%. If I had gotten 61 heads on Everest, the chance would have risen to 6.5%.

As most of you know, the convention in science is that if the probability of a particular distribution being random is less than 5%, we can claim that the two sets of numbers are significantly different. The first thing to note about this is that this is only a convention. It is entirely arbitrary; there is no theory that makes 0.05 better than 0.04 or 0.06. If it were higher, we'd have more false positives, if it were lower, we'd have more false negatives. Just like deciding whether the speed limit should be 55 or 65, in the end all you can do is pick one.

The second thing to note is that this is not the same as saying that the chance that my result is random (and thus wrong) is 0.047. Implicitly, statistical tests compare my actual hypothesis (altitude causes a change in coin flipping results) to an opposite "null" hypothesis (altitude doesn't cause a change in coin flipping results). Generically, the hypothesis being tested is always "these two sets of numbers are significantly different" and the null hypothesis is always "these two sets of numbers are not significantly different." So the 0.047 really means, "the chance that my results are a false positive is 4.7%, if the null hypothesis is true. Of course, if the null hypothesis is false, then my theory is true and my results are necessarily not a false positive. In the real world, we can't know whether my theory is true separate from statistics, but strong theory, well founded on prior results, should be true. In other words, if my theory is convincing, the total chance that my results are a false positive is much less than 5%. (In some fields, the chance that my results are a false positive will be further reduced by replication, but in management we don't do replication.)

Now, what if we try to work backwards? The problem is that, for every hundred random regressions I do, I'll get (by definition) an average of five "significant" but false results even if there's no actual relationship in my data. Once, simply doing the regressions was onerous and something of a brake on this kind of fishing, but now, if I have the data on my computer, I can easily do 100 regressions in an hour. If I do a thousand regressions, I'll have 50 that are significant, and 10 that look really significant (p < 0.01). (One of the annoying things about popular science reporting is the focus on really small values of p. For reasons not worth going into here, if the probability of the null hypothesis being true is less than 5%, it really doesn't matter how small it is.) It is much, much, much more likely (actually, all but certain) that I'll find significance when fishing around than when testing theory. If we work backwards, we have no way of knowing what the chance of a false positive is.

[Because I want to make sure this point is clear, I'm going to beat it over the head a bit. If I come up with a theory and then test it, I know that the chance that my results are random (a false positive) is no more than 5%. Because my theory is strong and based on prior research -- and to be published it has to be vetted by experienced scholars in the field -- I know that the real chance of a false positive is actually quite a bit less than 5%. But if I work backwards, I have no idea what the chance of a false positive is. The particular relationship I found is less than 5% likely to be random, but I know -- by definition, and thus with certainty -- that if I do 100 regressions on completely random data, some of those regressions will be sufficiently unlikely (p < 0.05). In other words, when I'm fishing, the chance of a false positive is 100%, even though it is still true that the chance of any particular significant result being false is less than 5%. If I take that result and then fit a theory to it, I have no idea what the real chance of that result being a false positive is. Inherent in the math behind statistics is the assumption that I am testing a relationship I theorized a priori. That assumption is necessarily violated if I find the relationship and then develop the theory.]

Why can't we just go find significance and then go see if we can develop strong theory? Because we can always find a theory to fit if we know the end point.


Bret said...

Clearly, eating Mexican lemons makes us better drivers, no?

Bret said...

Actually a nice post, but I couldn't help the snark. Sorry.

Eric Guinn said...

… you can't collect data without some theory that tells you which data is relevant. There is lots of unstructured information available about the world.

Yes, you can collect data absent theory. We can collect data on the number of fatal car crashes in a year simply as a matter of counting, because for us violent death matters even if we have absolutely no theory about those deaths. Similarly, we can count imported lemons for any number of theory-free reasons.

Which means there is a lot of unstructured data -- first order knowledge -- available about the world.

Obtaining information from data is where theory steps in.

Noting that both Mexican lemon imports and US Highway fatalities varied over time is itself data: the chart itself is just data, not information. Mexican lemons do not "explain" 97% of US highway fatalities until someone asserts there is a dependency between the two.

In other words, because there is no theory, those three data (lemons, deaths, and their trends over time) together provide no information.

Which is why I'm lost when you say statistics don't work backwards. Data doesn't necessarily require theory, nor does data about data.

Asserting dependencies, which are second order knowledge, does.

David said...

Bret: No apologies necessary.

Skipper: I'm confused by your confusion, since you explain it quite nicely. In the absence of theory, statistics provide no information.

Hey Skipper said...

This is where I fear being pedantic.

Data, in and of itself, is theory and information free, so I think you should have said Data without some theory conveys no information. There is lots of unstructured data available about the world.

Data doesn't require theory, information does.

Bret said...

You don't need theory to have information from data.

If I have the heights of people in the room, that's data but there's also information. I can calculate a median height and then know that half the people are taller and half are shorter.

You only need theory when trying to find relationships between data or when trying to extrapolate data to larger populations.

David said...

Right. In effect, statistics tells you nothing about causes. Only theory tells you about causes. Statistics tells you whether your data is consistent with your theory, which more or less will be the subject of Statistics III.

Hey Skipper said...


The median height is just more data.

For all populations greater than two, there exists a median height; that is just a brute fact. A median height, in and of itself, has no more information than an individual height.

Add a theory: the distribution of height is correlated with sex. That imposes a dependency between two variables which allows deriving information.

Given a median, it would be possible to calculate from that alone how many women are in a group with some degree of confidence that the calculation produces a result that has some likelihood of being within some amount of the actual number.

Or, given how many women are in a group, it calculate a median.

Both results are information extracted from data.

The difference is very clear to me.

I think.

Bret said...

This is a semantic argument. In my opinion, y'all are using an unusual and particularly definition for "information".

erp said...

Bret, I was reluctant to drag semantics into this discussion, but a definition of terms would be useful.

Information, data, theory, observation, patterns, etc. must have precise meanings vis a vis statistics that they don't have in casual conversation.

Bret said...

"Information" is not a formal statistical term.

Hey Skipper said...


Yes, I might be making a semantic argument. I hope not, though.

If I am not, it lies in my use of the concept of dependency.

All sets of data have a mean. But the mean has no more meaning than any individual measure.

I have a set of values ranging from 61 to 119. The mean is 79.

What's the information?

Bret said...

If you tell me the mean and standard deviation of a set of random numbers then there's no information because the numbers are random.

If you give me the mean and standard deviation of the heights of people in a room, it's information (at least by the definition of information that I use) because it's descriptive. When I walk into that room, I'd have a better idea of what I'd expect to see than if I didn't have that information. No theory is involved, it's just a description of something.

Just like saying, "Joe's eyes are blue". That's a single datum, no mean, no std, no theory, but still information. Having being told that, I'd expect that when I saw Joe's eyes, they'd be blue.