23 April 2010

Statistics III

The third installment of our little series on statistics is perhaps the most counter-intuitive lesson of all. The lesson is two-fold:

1. A sample only represents the population of interest as a whole if every member of the population of interest had the same chance of being sampled, which is what "random sample" means, regardless of how large the sample is.

2. Given a truly random sample, the accuracy of population estimates based on the sample depends only on the size of the sample, not on the size of the population.

("Sample" is used here to mean a subset of the population available for study, and thus not the population itself.)

8 comments:

Harry Eagar said...

OK. Let's take the Gallup Poll, which typically samples 1,500 people for its national political reports. That's about 1 per 2 counties.

I contend that knowing what Joe A. Blow in Flatgrass County thinks about candidate X is not a piece of information relevant to the political opinion of Flatgrass and its sister county, Sweetgrass.

Paulos, in his first book about statistics, argued that the case of a Florida woman who alleged got AIDS from dental work was a sham because the likelihood that a person of her characteristics would get AIDS in her county was 0 sigma deviation.

However, I contend that is a false correlation, because there are 7 ways to get AIDS, and living in a county is not one of them.

With Gallup, at some point you would have such a large sample that it is effectively the election itself. Yet the sample needed must be much larger than 1,500, because Gallup's results are so far off. As argued by Michael Wheeler in 'Lies, Damn Lies and Statistics.'

So, other than backcasting, what statistical test tells you whether your sample is big enough?

David said...

Great questions, Harry.

First, let's note that polls are not (and are not intended to be) fractal. A poll that is aimed to predict the results you would get if you asked every American cannot tell you anything "scientifically" about any subdivision, local, county, or state.

The point of the post is that it takes as many respondents (chosen randomly) to predict the opinions of the city, county and state as it does to predict the opinion of the nation. The size of the population you're trying to predict doesn't matter. (In fact, that statistics program doesn't even know the size of the population.)

[Some vocabulary: the "sample" is the group of people, drawn from the population of interest, who you actually survey or experiment upon. The "population" is what you are trying to model and predict; the opinion of US citizens, for example. The sample only reflects the population if the members of the sample are chosen randomly. The larger the sample, the more accurate the prediction, but the size of the population doesn't matter.]

The next question is, what is "big enough." The accepted answer is that a sample is big enough if it can give you a range within which, with 95% certainty, the "real" answer falls. So, for the Gallup polls, they want a sample that is large enough that they can give a point estimate of US opinion as a whole, along with a standard error that shows how broad a range you need to get to 95% certainty. Any sample will give you a point estimate. The smaller the sample, the larger the standard error.

On the other hand, there are diminishing returns. Past a certain point, it's really not worth increasing your sample because you don't really get a better estimate of the population. That point is generally thought to be at about 1350 members of your sample. If you want to estimate what Maui County thinks, ask 1350 randomly chosen people in Maui County. If you want to estimate what Hawaii thinks, ask 1350 randomly chosen people in Hawaii. If you want to know what likely US voters think, ask 1350 randomly chosen likely US voters.

Finally, note that this is a purely mathematical calculation, ignoring bias, social desirability effects, etc. If you had three hundred million marbles, some red and some blue and wanted to know the proportion of red to blue, you would randomly select 1350 marbles, count the red and blue marbles, and then you could calculate a point estimate for the population and a standard error for your estimate. You completely ignore the possibility that a red marble lies to you and claims that it's blue because you're an attractive young woman and the red marble assumes that you're a blue marble. Similarly, the errors announced by polling companies assume that people are marbles and don't lie, change their minds, etc.

The problem that you don't really know who will vote until the vote happens is similarly ignored.

Harry Eagar said...

Wheeler, writing back in 1976, also objected, re Gallup, that except for a few polls with sample size about double, the political questions were tacked on to commercial surveys that took about 90 minutes to complete, arguing that that was a kind of filter:

You didn't get political opinions, but political opinions of that part of the population that cared enough to spend 90 minutes answering questions about soap.

Wheeler much preferred (though he did not like polls at all) polls that did only politics. That was why the Iowa Poll was always called by journalists (though never by me) "the respected Iowa Poll."

I didn't respect it, though, and in 1980 I was in the newsroom of The Des Moines Register when the Iowa Poll boloed the congressional election and surprised its owners by wrongly estimating (or whatever verb you want to use) that a Democrat would win the Iowa Senate seat.

The Iowa Poll went into seclusion for some time, and it was rumored, but never announced, that the Register considering shutting it down. That didn't happen.

But for that and other reasons, including the ones you listed, I don't believe opinion polls.

Brit said...

It is counter-intuitive but I can believe it as maths is full of such counter-intuitive stuff. But, (and I guess this is Harry's point), I wonder whether the word 'random' can ever apply to an opinion poll, since unlike marbles people are less susceptible to the simple laws of maths.

AS well as the self-selecting problem even in supposedly random samples, there's also often obscure disincentives to honesty in answers, such as a feeling that one's real viewpoint is opposite to the morally correct one (eg. it's well known that voter opinion polls in the UK favour Labour because Conservatives don't like to admit that's what they are).

Harry Eagar said...

Opinion polls are only a subset of statistics, but particularly interesting.

Not only is there the lying problem, but now that so many people don't have landlines and are not in phone books, you hit the Literary Digest problem in reverse: it becomes a random sample of old fogies who don't rely on cell phones.

The Maui News never reports polls as news. I have never spoken to the editor about his reasons, but I suppose they must be similar to mine: We do not think political opinion polls are reliable.

Hey Skipper said...

That point is generally thought to be at about 1350 members of your sample.

Does that mean you can't statistically treat populations of less than 1350?

Which is really a roundabout way of suggesting that unless the Brifa tree ring proxy study used at least 1350 cores, then it was fatally flawed from the outset.

Harry Eagar said...

I await the answer to that with bated breath.

If 1,350 is the minimum, then just about all cancer epidemiology is out the window.

I have seen numerous studies with a population of fewer than 10.

Ali said...

There is a company in Britain - YouGov - that pays an invited pool of members small amounts to take part in their surveys.

They tend to get decent predictability.