Sensible Significance testing in Market Research Cross
It is common practice in market research, and perhaps survey
research in general, to produce up to several hundred cross
tabulations .. at least one per question asked.
Typically there is a "standard break" - the horizontal axis
of the table - applied in each case. This "break" defines
groups of interest to the researcher e.g.
Aged under 35
Aged 36 -55
Aged 56 and over
Males Under 18
Females Under 18
Clearly these groups are not mutually exclusive.
Given this large volume of output, it is obviously of
interest to the researcher to have some method of drawing
attention to "differences of importance". One way of doing
this is to do numerous "tests of significance" and to
display the results with some device e.g. +++ or --- or ***
under the cell percentage.
To clarify, say the cross tabulation (crosstab) comprises
the above mentioned standard break across the page, and some
field (say, "Intention to Purchase") with a smallish number
of categories (say 5) down the page, the values in the
crosstab being vertical percentages i.e. percentages of the
base in each column.
The significance test used for a given cell is a two tailed
t test of the null hypothesis that there is no difference
between the proportion falling into this row in this column
(p1) and p2, the proportion falling into the corresponding
row in the "balance" column (imputed from the total column).
If there are 5 categories in this column, then there will be
5 such tests (per column). If the null hypothesis is
rejected at some level of significance (say 5%), then the
cell is "marked" .. if p1 <p2 then the mark is ---, if
p1>p2 then the mark is +++. There is no mark if the null
hypothesis is not rejected.
This use of significance testing is somewhat in the spirit
of Jones and Tukey's proposal : "A Sensible Formulation of
the Significance Test" (
) in that
no statement is made if the sample data does not support a
difference in means.
A brief extract from the Jones and Tukey paper
"As a consequence, we should not set forth a null hypothesis
because to do so is unrealistic and misleading. Instead, we
should assess the sample data and entertain one of three
(1) act as if µA - µB > 0;
(2) act as if µA - µB < 0;
(3) act as if the sign of µA - µB, is
indefinite, i.e., is not (yet) determined.
This specification is similar to "the three alternative
decisions" proposed by Tukey (1960, p. 425).
With this formulation, a conclusion is in error only when it
is "a reversal," when it asserts one direction while the
(unknown) truth is the other direction. Asserting that the
direction is not yet established may constitute a wasted
opportunity, but is not an error. We want to control the
rate of error, the reversal rate, while minimizing wasted
opportunity, i.e., while minimizing indefinite results."
..The adoption of this proposed three-alternative conclusion
procedure lends itself to reporting the effect size as the
estimated value of µA - µB (either standardized or not).
Regardless of the size of that estimate, and regardless of
whether or not the calculated value of t falls in the
rejection region, it seems appropriate to report the p value
as the area of the t distribution more positive or more
negative (but not both) than the value of t obtained from
(yA - yB)/sd). (The limiting values of p then are 0, as
the absolute value of t becomes indefinitely large, and 1/2,
as the value of t approaches zero.)
For any specified positive or negative population mean
difference, there may be found in the usual way the
probability of a Type II error, of withholding judgment when
the parametric difference is as specified. For each
specified difference, the probability of a Type II error is
smaller than that for the conventional two-tailed test of
significance. Thus, the proposed procedure is uniformly
more powerful than the conventional procedure.
Hodges & Lehmann (1954) proposed a modification of the
traditional Student test, converting it from a two-sided
test of the null hypothesis to two one-sided tests. Kaiser
(1960) proposed combining the two one-sided tests into a
single test, but one with two directional alternative
hypotheses, µA < µB and µA > µB. (For further discussion,
see Bohrer, 1979, Bulmer, 1957, and Harris, 1997.) However,
the unrealistic null hypothesis of zero mean difference in
the population is included in these proposals, in contrast
to the formulation above. By acknowledging the fiction of
the null hypothesis, and following the implications from
"every null hypothesis is false," our formulation yields,
for any sample size and any value of a, a test with greater
reported sensitivity to detect the direction of a difference
between two population means.
Note that those accustomed to make "tests of hypotheses" at
.05 would, using the procedures set forth here, do the same
arithmetic but would describe their results as a "test of
significance" at one-half of .05, i.e., at .025.
Alternatively, to maintain at .05 the probability of acting
as if the parametric difference is in one direction when, in
fact, it is in the other, the investigator would employ the
.10 tabled value of alpha.
In this context where the objective is to draw attention to
differences of importance it might make sense to use two 1
tailed t tests where the test is against a difference of a
particular size (a threshold). This thresholding would
presumably reduce the number of false positives (type I
error), but increase the number of "no calls" (false
negatives-Type II errors)
So far, so reasonable. But clearly we are conducting
multiple significance tests .
With 5 tests in a single column and alpha at .05, the
probability of finding at least one significant difference
when there are none is 1-.95^5 is 0.22 : but this is only a
rough calculation because the 5 tests are not independent.
Setting alpha at .01 is a partial solution, perhaps ..
1-.99^5 is .05 .. but again at the expense of the power of
the test .. we will get more false negatives.
Is there some way we can make a sensible tradeoff here?
(Since the sample sizes are fixed, the only factors
affecting power are the presumed/desired detectable effect
size and alpha.)
I don't know how to calculate the power of a set of
independent tests ... can anyone help me here?
Still on the "single column, multiple tests" : other
a) the Bonferroni method.. multiply
alpha by 5 (or should it be 4?)
b) a preliminary chi-square test at alpha=0.05 , no differences to be
reported unless the chi-square test suggests that there are
indeed differences. If I do do the preliminary chi-square,
should I then also do the Bonferroni adjustment?
So far, I have considered only one column. These tests are
to be applied to all columns .. so, 50 such tests. The
columns most certainly do not represent independent
samples.. clearly, from the descriptions, they overlap.
I am semi convinced that, having solved the one column
problem, no further adjustments to the testing procedure are
needed. The simple argument is reductio-ad-absurdem .. if I
added a column (L) which was identical to an existing
column (K), would I expect the "significance results" in
column K to be affected by this action? No. What if L was
not identical to K? same answer. So, the one-column at a
time idea seems OK. Maybe.
Now, consider another table .. same "break" but this time
the field "down" the page is "Frequency of Purchase".
Another set of 50 tests.
Obviously if we keep on doing this (adding tables) we are
going to find "significant results". But I am not sure that
the answer to the problem is to apply some mass (across
tables/across variates) Bonferroni adjustment.
Can anyone give me some technical/conceptual help here?