07 April 2010

Statistics I

Since it's become clear that I'm not going to write one long post on statistics, I've decided to write an infinite series of short posts on statistics.

Don't say I didn't warn you.

In this post, I'll start with the purpose of statistics: the purpose of statistics is to tell us whether two sets of numbers, that differ based on some characteristic that we suspect might be important, are really different. Although it can get very sophisticated (for example, we sometimes don't know the second set of numbers), that's really all statistics does or can do.

In particular, statistics can't work backwards. We can't see that two sets of numbers are different, and then work our way back to figure out what the difference is. Theory must drive statistics, but statistics can't drive theory.

Next time, what alpha means, and what what alpha means means.


Brit said...

As usual, I don't follow you. Please provide copious examples.

Peter said...

I'll start with the purpose of statistics

Did we ask?

Are you trying to compete with AOG for the inscrutable post of the year award?

Harry Eagar said...

You must be defining statistics narrowly.

Don't most people consider John Snow's survey of the mortality bills a use of statistics, and didn't he draw conclusions from that?

(I have read Edward Tufte's discussion that claims he overinterpreted his data, but he did turn out to be right.)

David said...

In reverse chronological order:

Harry: What Snow did was statistics, or at least used statistics, but I don't know why you think my definition doesn't include Snow's work. Snow showed, albeit non-rigorously, that the number of cases of cholera in people who used water from a particular pump differed significantly from the number of cases of cholera among people using other pumps.

Peter: Yes. Yes, you did.

Brit: Copious examples to come.

Brit said...

So, is your 'can't work backwards' point that without a theory about pumps causing cholera, figures about pump use and cholera death would just be a big indecipherable, incomprehensible mess, rather than stats?

David said...

I'm saying something even more dispiriting than that. I'm saying that, without a theory that causes us to look at whether pumps and cholera are connected, data that seems to show that they are connected is meaningless.

Peter said...


Are you sure? I do recall begging you to elaborate on your hypothesis that atheism is a Christian heresy, but my search for the purpose of statistics escapes my mind.

But I think you have got Harry all excited, so I'll just sit back politely and watch.

Harry Eagar said...

I guess I'm not getting something.

I had thought that the two, theory and anomalies, worked like clapping hands.

Surely the observation of the perturbation of the orbit of Mercury preceded and inspired a theory?

Or, at least, damaged the old theory?

I don't recall what the theory of the etiology of cholera was pre-Snow. Miasmas?

Are you suggesting he had the theory (water carries cholera) and went looking for a pattern to confirm that?

At some point, the fact that we are pattern-finding animals must come in. If we can find patterns when there really are none, and statistics can tell us when, surely sometimes we find patterns that are really there.

We are on the verge of one of my favorite pieces of advice, from Edwards Deming: You've got to have a theory, without a theory how do you know when you are wrong?

David said...

Harry: Good points, but those are issues about science, not really issues about statistics.

When differences are clear, as they were with cholera, then induction seems like a legitimate scientific system, but as we'll see, with most statistics, induction won't work and, because we're pattern seeing animals, can be very dangerous.

Hey Skipper said...

In particular, statistics can't work backwards. We can't see that two sets of numbers are different, and then work our way back to figure out what the difference is.

I lost you here.

It is true that there are different hair colors. It seems true that hair color prevalence varies geographically. Backwards looking statistics is the only way to demonstrate whether that intuitive observation -- which has no theoretical content at all -- is true.

One could go a step further and correlate that observation (provided it is true) with other factors, and then have a set of observations that are still theory-free.

By that I mean one is still working with first order knowledge.

So it seems to me (having taken only a few statistics classes, and none more recently than 20 years ago) that exercise uses statistics to work backwards, determines an actual (as opposed to an apparent) difference exists, and then works towards a theoretical explanation.

erp said...

David, unlike Skipper, my last statistics class was 50+ years ago, so please humor me. Are you saying that because we're pattern seeking animals, we should ignore our observations?

Isn't it our ability to discern patterns which led us to create theories and a system to help explain them?