Why don't statistics work backwards? There are actually many reasons, but three are key. First, statistics requires data and data requires theory. Second, theory is about causation and statistics are about correlation. Correlation does not imply causation. Third, it just can't work backwards.

1. You can't do statistics unless you have data, so data proceeds statistics. But you can't collect data without some theory that tells you which data is relevant. There is lots of unstructured information available about the world. It can't be analyzed unless its been sorted and classified, and since that necessarily must come before statistical analysis, it can only be sorted through theory. (Even with archival data, it only exists before it's relevant.) No one is ever going to regress corporate performance against CEO hair color because, even though the information is available, it makes no theoretical sense.

2. Theory is a proposed causal relationship between two constructs. Statistics depends upon the correlation between one or more independent variables (the possible cause(s)) and the dependent variable (the supposed effect). You can't get to theory from statistics because correlation does not imply causation, though causation requires correlation. There are lots of pairs of things that correlate, even very strongly, without being causally related. Sometimes that's because both are caused by some third construct. Consider, for example, this graph of the relationship between the number of lemons imported from Mexico and US highway fatalities:

(I stole this graph from Derek Lowe, who in turn stole it from a letter to the editor in the Journal of Chemical Information and Modeling (Johnson, 2008).)

"R

^{2} = 0.97" means that Mexican lemons explain approximately 97% of the variability in US highway fatalities over time, an almost unheard of result in the social sciences and much better than lots of studies that have caused otherwise rational people to upset their way of life. Why shouldn't we, then, repeal the traffic laws and just start importing lemons willy-nilly? Because theory tells us that lemons don't cause a reduction in highway deaths, no matter what statistics tells us.

What's really going on in this graph is that we've gotten richer over time, mostly as a result of increased productivity resulting from advanced technology. Richer means that we're importing more luxuries, like Mexican lemons. Advanced technology means that cars can be safer and richer means that safer cars are affordable. (For all I know, it might also be that technology advances have made importing lemons from Mexico easier and cheaper, increasing demand.) In any event, imported lemons and highway fatalities only correlate with each other because they both have causal relationships with a missing factor or two. Only theory can tell us whether a factor is missing and what it might be. Even then, it usually doesn't.

3. The third reason we can't work backward is because that's just not how it works. This has to do with our acceptance of false positives, and with null hypothesis testing, so it's a little more complicated.

Statistics works by trying to figure out, if our two data sets really were random rather than systematically different based on the independent variable of interest, how likely it would be that we would get that distribution. Let's say we suspected, for example, that altitude effects whether a coin lands heads or tails. To test our theory, we flip a coin at ground level and at the summit of Mt. Everest. On the ground, we get 48 heads and 52 tails. At altitude, we get 62 heads and 38 tails. How likely is it that our distribution could happen at random? As it happens, I can test that: the chance of getting that distribution randomly is 0.047, or just under 5%. If I had gotten 61 heads on Everest, the chance would have risen to 6.5%.

As most of you know, the convention in science is that if the probability of a particular distribution being random is less than 5%, we can claim that the two sets of numbers are significantly different. The first thing to note about this is that this is only a convention. It is entirely arbitrary; there is no theory that makes 0.05 better than 0.04 or 0.06. If it were higher, we'd have more false positives, if it were lower, we'd have more false negatives. Just like deciding whether the speed limit should be 55 or 65, in the end all you can do is pick one.

The second thing to note is that this is not the same as saying that the chance that my result is random (and thus wrong) is 0.047. Implicitly, statistical tests compare my actual hypothesis (altitude causes a change in coin flipping results) to an opposite "null" hypothesis (altitude doesn't cause a change in coin flipping results). Generically, the hypothesis being tested is always "these two sets of numbers are significantly different" and the null hypothesis is always "these two sets of numbers are not significantly different." So the 0.047 really means, "the chance that my results are a false positive is 4.7%,

**if the null hypothesis is true**. Of course, if the null hypothesis is false, then my theory is true and my results are necessarily not a false positive. In the real world, we can't know whether my theory is true separate from statistics, but strong theory, well founded on prior results,

*should* be true. In other words, if my theory is convincing, the total chance that my results are a false positive is much less than 5%. (In some fields, the chance that my results are a false positive will be further reduced by replication, but in management we don't do replication.)

Now, what if we try to work backwards? The problem is that, for every hundred random regressions I do, I'll get (by definition) an average of five "significant" but false results even if there's no actual relationship in my data. Once, simply doing the regressions was onerous and something of a brake on this kind of fishing, but now, if I have the data on my computer, I can easily do 100 regressions in an hour. If I do a thousand regressions, I'll have 50 that are significant, and 10 that look really significant (

*p* < 0.01). (One of the annoying things about popular science reporting is the focus on really small values of

*p*. For reasons not worth going into here, if the probability of the null hypothesis being true is less than 5%, it really doesn't matter how small it is.) It is much, much, much more likely (actually, all but certain) that I'll find significance when fishing around than when testing theory. If we work backwards, we have no way of knowing what the chance of a false positive is.

[Because I want to make sure this point is clear, I'm going to beat it over the head a bit. If I come up with a theory and then test it, I know that the chance that my results are random (a false positive) is no more than 5%. Because my theory is strong and based on prior research -- and to be published it has to be vetted by experienced scholars in the field -- I know that the real chance of a false positive is actually quite a bit less than 5%. But if I work backwards, I have no idea what the chance of a false positive is. The particular relationship I found is less than 5% likely to be random, but I know -- by definition, and thus with certainty -- that if I do 100 regressions on completely random data, some of those regressions will be sufficiently unlikely (

*p* < 0.05). In other words, when I'm fishing, the chance of a false positive is 100%, even though it is still true that the chance of any particular significant result being false is less than 5%. If I take that result and then fit a theory to it, I have no idea what the real chance of that result being a false positive is. Inherent in the math behind statistics is the assumption that I am testing a relationship I theorized

*a priori*. That assumption is necessarily violated if I find the relationship and then develop the theory.]

Why can't we just go find significance and then go see if we can develop strong theory? Because we can always find a theory to fit if we know the end point.