Sensible Significance testing in Market Research Cross

Tabulations

It is common practice in market research, and perhaps survey

research in general, to produce up to several hundred cross

tabulations .. at least one per question asked.

Typically there is a "standard break" - the horizontal axis

of the table - applied in each case. This "break" defines

groups of interest to the researcher e.g.

Total

Males

Females

Aged under 35

Aged 36 -55

Aged 56 and over

Males Under 18

Females Under 18

Heavy Users

City 1

City 2

Clearly these groups are not mutually exclusive.

Given this large volume of output, it is obviously of

interest to the researcher to have some method of drawing

attention to "differences of importance". One way of doing

this is to do numerous "tests of significance" and to

display the results with some device e.g. +++ or --- or ***

under the cell percentage.

To clarify, say the cross tabulation (crosstab) comprises

the above mentioned standard break across the page, and some

field (say, "Intention to Purchase") with a smallish number

of categories (say 5) down the page, the values in the

crosstab being vertical percentages i.e. percentages of the

base in each column.

The significance test used for a given cell is a two tailed

t test of the null hypothesis that there is no difference

between the proportion falling into this row in this column

(p1) and p2, the proportion falling into the corresponding

row in the "balance" column (imputed from the total column).

If there are 5 categories in this column, then there will be

5 such tests (per column). If the null hypothesis is

rejected at some level of significance (say 5%), then the

cell is "marked" .. if p1 <p2 then the mark is ---, if

p1>p2 then the mark is +++. There is no mark if the null

hypothesis is not rejected.

This use of significance testing is somewhat in the spirit

of Jones and Tukey's proposal : "A Sensible Formulation of

the Significance Test" (

http://forrest.psych.unc.edu/jones-tukey112399.html) in that

no statement is made if the sample data does not support a

difference in means.

A brief extract from the Jones and Tukey paper

<extract>

"As a consequence, we should not set forth a null hypothesis

because to do so is unrealistic and misleading. Instead, we

should assess the sample data and entertain one of three

conclusions:

(1) act as if µA - µB > 0;

(2) act as if µA - µB < 0;

or

(3) act as if the sign of µA - µB, is

indefinite, i.e., is not (yet) determined.

This specification is similar to "the three alternative

decisions" proposed by Tukey (1960, p. 425).

With this formulation, a conclusion is in error only when it

is "a reversal," when it asserts one direction while the

(unknown) truth is the other direction. Asserting that the

direction is not yet established may constitute a wasted

opportunity, but is not an error. We want to control the

rate of error, the reversal rate, while minimizing wasted

opportunity, i.e., while minimizing indefinite results."

Snip

..The adoption of this proposed three-alternative conclusion

procedure lends itself to reporting the effect size as the

estimated value of µA - µB (either standardized or not).

Regardless of the size of that estimate, and regardless of

whether or not the calculated value of t falls in the

rejection region, it seems appropriate to report the p value

as the area of the t distribution more positive or more

negative (but not both) than the value of t obtained from

(yA - yB)/sd). (The limiting values of p then are 0, as

the absolute value of t becomes indefinitely large, and 1/2,

as the value of t approaches zero.)

For any specified positive or negative population mean

difference, there may be found in the usual way the

probability of a Type II error, of withholding judgment when

the parametric difference is as specified. For each

specified difference, the probability of a Type II error is

smaller than that for the conventional two-tailed test of

significance. Thus, the proposed procedure is uniformly

more powerful than the conventional procedure.

Hodges & Lehmann (1954) proposed a modification of the

traditional Student test, converting it from a two-sided

test of the null hypothesis to two one-sided tests. Kaiser

(1960) proposed combining the two one-sided tests into a

single test, but one with two directional alternative

hypotheses, µA < µB and µA > µB. (For further discussion,

see Bohrer, 1979, Bulmer, 1957, and Harris, 1997.) However,

the unrealistic null hypothesis of zero mean difference in

the population is included in these proposals, in contrast

to the formulation above. By acknowledging the fiction of

the null hypothesis, and following the implications from

"every null hypothesis is false," our formulation yields,

for any sample size and any value of a, a test with greater

reported sensitivity to detect the direction of a difference

between two population means.

Note that those accustomed to make "tests of hypotheses" at

.05 would, using the procedures set forth here, do the same

arithmetic but would describe their results as a "test of

significance" at one-half of .05, i.e., at .025.

Alternatively, to maintain at .05 the probability of acting

as if the parametric difference is in one direction when, in

fact, it is in the other, the investigator would employ the

.10 tabled value of alpha.

</extract>

In this context where the objective is to draw attention to

differences of importance it might make sense to use two 1

tailed t tests where the test is against a difference of a

particular size (a threshold). This thresholding would

presumably reduce the number of false positives (type I

error), but increase the number of "no calls" (false

negatives-Type II errors)

So far, so reasonable. But clearly we are conducting

multiple significance tests .

With 5 tests in a single column and alpha at .05, the

probability of finding at least one significant difference

when there are none is 1-.95^5 is 0.22 : but this is only a

rough calculation because the 5 tests are not independent.

Setting alpha at .01 is a partial solution, perhaps ..

1-.99^5 is .05 .. but again at the expense of the power of

the test .. we will get more false negatives.

Is there some way we can make a sensible tradeoff here?

(Since the sample sizes are fixed, the only factors

affecting power are the presumed/desired detectable effect

size and alpha.)

I don't know how to calculate the power of a set of

independent tests ... can anyone help me here?

Still on the "single column, multiple tests" : other

possibilities …

a) the Bonferroni method.. multiply

alpha by 5 (or should it be 4?)

b) a preliminary chi-square test at alpha=0.05 , no differences to be

reported unless the chi-square test suggests that there are

indeed differences. If I do do the preliminary chi-square,

should I then also do the Bonferroni adjustment?

So far, I have considered only one column. These tests are

to be applied to all columns .. so, 50 such tests. The

columns most certainly do not represent independent

samples.. clearly, from the descriptions, they overlap.

I am semi convinced that, having solved the one column

problem, no further adjustments to the testing procedure are

needed. The simple argument is reductio-ad-absurdem .. if I

added a column (L) which was identical to an existing

column (K), would I expect the "significance results" in

column K to be affected by this action? No. What if L was

not identical to K? same answer. So, the one-column at a

time idea seems OK. Maybe.

Now, consider another table .. same "break" but this time

the field "down" the page is "Frequency of Purchase".

Another set of 50 tests.

Obviously if we keep on doing this (adding tables) we are

going to find "significant results". But I am not sure that

the answer to the problem is to apply some mass (across

tables/across variates) Bonferroni adjustment.

Can anyone give me some technical/conceptual help here?

Tabulations

It is common practice in market research, and perhaps survey

research in general, to produce up to several hundred cross

tabulations .. at least one per question asked.

Typically there is a "standard break" - the horizontal axis

of the table - applied in each case. This "break" defines

groups of interest to the researcher e.g.

Total

Males

Females

Aged under 35

Aged 36 -55

Aged 56 and over

Males Under 18

Females Under 18

Heavy Users

City 1

City 2

Clearly these groups are not mutually exclusive.

Given this large volume of output, it is obviously of

interest to the researcher to have some method of drawing

attention to "differences of importance". One way of doing

this is to do numerous "tests of significance" and to

display the results with some device e.g. +++ or --- or ***

under the cell percentage.

To clarify, say the cross tabulation (crosstab) comprises

the above mentioned standard break across the page, and some

field (say, "Intention to Purchase") with a smallish number

of categories (say 5) down the page, the values in the

crosstab being vertical percentages i.e. percentages of the

base in each column.

The significance test used for a given cell is a two tailed

t test of the null hypothesis that there is no difference

between the proportion falling into this row in this column

(p1) and p2, the proportion falling into the corresponding

row in the "balance" column (imputed from the total column).

If there are 5 categories in this column, then there will be

5 such tests (per column). If the null hypothesis is

rejected at some level of significance (say 5%), then the

cell is "marked" .. if p1 <p2 then the mark is ---, if

p1>p2 then the mark is +++. There is no mark if the null

hypothesis is not rejected.

This use of significance testing is somewhat in the spirit

of Jones and Tukey's proposal : "A Sensible Formulation of

the Significance Test" (

http://forrest.psych.unc.edu/jones-tukey112399.html) in that

no statement is made if the sample data does not support a

difference in means.

A brief extract from the Jones and Tukey paper

<extract>

"As a consequence, we should not set forth a null hypothesis

because to do so is unrealistic and misleading. Instead, we

should assess the sample data and entertain one of three

conclusions:

(1) act as if µA - µB > 0;

(2) act as if µA - µB < 0;

or

(3) act as if the sign of µA - µB, is

indefinite, i.e., is not (yet) determined.

This specification is similar to "the three alternative

decisions" proposed by Tukey (1960, p. 425).

With this formulation, a conclusion is in error only when it

is "a reversal," when it asserts one direction while the

(unknown) truth is the other direction. Asserting that the

direction is not yet established may constitute a wasted

opportunity, but is not an error. We want to control the

rate of error, the reversal rate, while minimizing wasted

opportunity, i.e., while minimizing indefinite results."

Snip

..The adoption of this proposed three-alternative conclusion

procedure lends itself to reporting the effect size as the

estimated value of µA - µB (either standardized or not).

Regardless of the size of that estimate, and regardless of

whether or not the calculated value of t falls in the

rejection region, it seems appropriate to report the p value

as the area of the t distribution more positive or more

negative (but not both) than the value of t obtained from

(yA - yB)/sd). (The limiting values of p then are 0, as

the absolute value of t becomes indefinitely large, and 1/2,

as the value of t approaches zero.)

For any specified positive or negative population mean

difference, there may be found in the usual way the

probability of a Type II error, of withholding judgment when

the parametric difference is as specified. For each

specified difference, the probability of a Type II error is

smaller than that for the conventional two-tailed test of

significance. Thus, the proposed procedure is uniformly

more powerful than the conventional procedure.

Hodges & Lehmann (1954) proposed a modification of the

traditional Student test, converting it from a two-sided

test of the null hypothesis to two one-sided tests. Kaiser

(1960) proposed combining the two one-sided tests into a

single test, but one with two directional alternative

hypotheses, µA < µB and µA > µB. (For further discussion,

see Bohrer, 1979, Bulmer, 1957, and Harris, 1997.) However,

the unrealistic null hypothesis of zero mean difference in

the population is included in these proposals, in contrast

to the formulation above. By acknowledging the fiction of

the null hypothesis, and following the implications from

"every null hypothesis is false," our formulation yields,

for any sample size and any value of a, a test with greater

reported sensitivity to detect the direction of a difference

between two population means.

Note that those accustomed to make "tests of hypotheses" at

.05 would, using the procedures set forth here, do the same

arithmetic but would describe their results as a "test of

significance" at one-half of .05, i.e., at .025.

Alternatively, to maintain at .05 the probability of acting

as if the parametric difference is in one direction when, in

fact, it is in the other, the investigator would employ the

.10 tabled value of alpha.

</extract>

In this context where the objective is to draw attention to

differences of importance it might make sense to use two 1

tailed t tests where the test is against a difference of a

particular size (a threshold). This thresholding would

presumably reduce the number of false positives (type I

error), but increase the number of "no calls" (false

negatives-Type II errors)

So far, so reasonable. But clearly we are conducting

multiple significance tests .

With 5 tests in a single column and alpha at .05, the

probability of finding at least one significant difference

when there are none is 1-.95^5 is 0.22 : but this is only a

rough calculation because the 5 tests are not independent.

Setting alpha at .01 is a partial solution, perhaps ..

1-.99^5 is .05 .. but again at the expense of the power of

the test .. we will get more false negatives.

Is there some way we can make a sensible tradeoff here?

(Since the sample sizes are fixed, the only factors

affecting power are the presumed/desired detectable effect

size and alpha.)

I don't know how to calculate the power of a set of

independent tests ... can anyone help me here?

Still on the "single column, multiple tests" : other

possibilities …

a) the Bonferroni method.. multiply

alpha by 5 (or should it be 4?)

b) a preliminary chi-square test at alpha=0.05 , no differences to be

reported unless the chi-square test suggests that there are

indeed differences. If I do do the preliminary chi-square,

should I then also do the Bonferroni adjustment?

So far, I have considered only one column. These tests are

to be applied to all columns .. so, 50 such tests. The

columns most certainly do not represent independent

samples.. clearly, from the descriptions, they overlap.

I am semi convinced that, having solved the one column

problem, no further adjustments to the testing procedure are

needed. The simple argument is reductio-ad-absurdem .. if I

added a column (L) which was identical to an existing

column (K), would I expect the "significance results" in

column K to be affected by this action? No. What if L was

not identical to K? same answer. So, the one-column at a

time idea seems OK. Maybe.

Now, consider another table .. same "break" but this time

the field "down" the page is "Frequency of Purchase".

Another set of 50 tests.

Obviously if we keep on doing this (adding tables) we are

going to find "significant results". But I am not sure that

the answer to the problem is to apply some mass (across

tables/across variates) Bonferroni adjustment.

Can anyone give me some technical/conceptual help here?

Experts Exchange Solution brought to you by

Enjoy your complimentary solution view.

Get this solution by purchasing an Individual license!
Start your 7-day free trial.

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

The analogy isn't exact, because in your case you don't quite have replicates, but the similarity is striking. In your case, it would tell you: expect n significant results at random. It wouldn't identify them for you, but it would give you an idea of how much error you have, and by increasing your cutoff level you could tune sensitivity/specificity (e.g. variance/bias).

Anyways, my experience with multiple-testing is that a simple Bonferroni adjustment doesn't work well in most cases. Did you look at a Westfall step-down procedure?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trialThank you . I was unaware of the Westfall procedure but subsequent to your suggestion I have done a little research and can see potential applicability.

Thank you also for the suggestion of parallels with genomics research and the analysis of microarrays : this gives me some more food for thought and most probably a productive line of enquiry.

btw, the Westfall procedure is implemented in R as procedure mt.maxT in package multtest, so I can readily do some experimentation.

thank you again

-Tats

at John @ Data Sciences Research . com (remove all the spaces) I can keep you up to date with my progress.

The genomics problem set may well be unique in statistics, in that the number of dimensions routinely measured per individual is >>>the number of individuals. And with an unknown correlation structure? or one that has to be estimated from the data at hand. I cannot help thinking that there must be analogous situations where a large ratio of attributes measured to cases measured upon is the natural order of things .. perhaps in physics, high dimensional time series .. well, just idle speculation.

There remains a substantive problem in my little corner of the world .. how to sensibly draw attention to differences that are worth following up, understanding that there is a cost in the follow up and a cost of missing out on some gems.

I shall pursue this. Thank you for informing my thinking.

rgds

Programming Theory

From novice to tech pro — start learning today.

Experts Exchange Solution brought to you by

Enjoy your complimentary solution view.

Get this solution by purchasing an Individual license!
Start your 7-day free trial.

IMO, you should make up a small numerical example, ask the question(s) in simple terms and hope someone can answer it. Then extend this solution to your problem.

My guess is that we're talking about analysis of variance (or maybe analysis of covariance which I know basically nothing about).