Significance Testing is not Magic

A ten-minute survey produces a lot of data. And if we’re trying to study the differences between different demographics, then you end up with a lot of comparisons. (A 20-question survey comparing men/women, three age ranges, and three income ranges yields 140 comparisons!).

In a typical data table, these comparisons will be “tested” for significance – usually denoted by a little superscript next to the percentage in a cell.

But significance testing like this is misguided. Here’s why…

Every significance test implies a hypothesis was tested

Generating 140 t-statistics implies that we had 140 hypotheses that we wanted the data to help accept/reject. That’s simply never the case in corporate market research. Hypothesis development is a careful and painful process. It requires a theoretical understanding of the subject matter followed by a painstakingly detailed experimental design and execution.

Corporate market research is usually much more exploratory. Deadlines and budgets mean we can’t always design the study exactly how we want. The result is surveys that are typically a little longer than we wanted in the hope that something in there will have the answer we’re looking for.

5% of 140 is seven

Conducting mass hypotheses tests puts the researcher at risk of interpreting random differences as true differences. (https://xkcd.com/882/)

If you are using a 95% confidence interval, then 5% of the time you’ll get a false-positive (type I error). That’s really dangerous when you have 140 comparisons because it means seven differences may not be real.

It doesn’t make the results any more scientific

Testing 140 hypotheses at once is like throwing a bowl of spaghetti against the wall and seeing what sticks. Putting on a lab coat doesn’t make what you did an “experiment” any more than computing 140 t-statistics does.

Survey questions are not independent

Experiments are supposed to be independent. The results of hypothesis 1 should not have any impact on the results of hypothesis 2. For example, if the survey results show men are more likely than women to do X. Then later in the same results you find men are less likely to do Y, by computing two t-statistics, you’re implying that these results are independent of each other.

Of course, there’s still value in the survey results. But a good researcher needs to recognize that a t-value of 1.96 or more doesn’t represent some magical boundary between reportable vs. non-reportable data.

Treat your data as a set, not an enormous combination of combinations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s