Solved

Multiple significance tests

Posted on 2003-11-11
5
493 Views
Last Modified: 2013-11-13
Sensible Significance testing in Market Research Cross
Tabulations

It is common practice in market research, and perhaps survey
research in general, to produce up to several hundred cross
tabulations .. at least one per question asked.

Typically there is a "standard break" - the horizontal axis
of the table - applied in each case. This "break" defines
groups of interest to the researcher e.g.  
Total
Males
Females
Aged under 35
Aged 36 -55
Aged 56 and over
Males Under 18
Females Under 18
Heavy Users
City 1
City 2

Clearly these groups are not mutually exclusive.

Given this large volume of output, it is obviously of
interest to the researcher to have some method of drawing
attention to "differences of importance". One way of doing
this is to do numerous "tests of significance" and to
display the results with some device e.g. +++ or --- or ***
under the cell percentage.

To clarify, say the cross tabulation (crosstab) comprises
the above mentioned standard break across the page, and some
field (say, "Intention to Purchase") with a smallish number
of categories (say 5) down the page, the values in the
crosstab being vertical percentages i.e. percentages of the
base in each column.

The significance test used for a given cell is a two tailed
t test of the null hypothesis that there is no difference
between the proportion falling into this row in this column
(p1) and p2, the proportion falling into the corresponding
row in the "balance" column (imputed from the total column).
If there are 5 categories in this column, then there will be
5 such tests (per column). If the null hypothesis is
rejected at some level of significance (say 5%), then the
cell is "marked" .. if  p1 <p2 then the mark is ---, if
p1>p2 then the mark is +++. There is no mark if the null
hypothesis is not rejected.

This use of significance testing is somewhat in the spirit
of  Jones and Tukey's proposal : "A Sensible Formulation of
the Significance Test" (
http://forrest.psych.unc.edu/jones-tukey112399.html) in that
no statement is made if the sample data does not support a
difference in means.

A brief extract from the Jones and Tukey paper
<extract>

"As a consequence, we should not set forth a null hypothesis
because to do so is unrealistic and misleading.  Instead, we
should assess the sample data and entertain one of three
conclusions:

(1)               act as if µA - µB  > 0;

(2)               act as if µA  - µB  < 0;

or

(3)               act as if the sign of µA - µB, is
indefinite, i.e., is not (yet) determined.

This specification is similar to  "the three alternative
decisions" proposed by Tukey (1960, p. 425).

With this formulation, a conclusion is in error only when it
is "a reversal," when it asserts one direction while the
(unknown) truth is the other direction.  Asserting that the
direction is not yet established may constitute a wasted
opportunity, but is not an error.  We want to control the
rate of error, the reversal rate,  while minimizing wasted
opportunity, i.e., while minimizing indefinite results."

Snip

..The adoption of this proposed three-alternative conclusion
procedure lends itself to reporting the effect size as the
estimated value of  µA - µB  (either standardized or not).
Regardless of the size of that estimate, and regardless of
whether or not the calculated value of t falls in the
rejection region, it seems appropriate to report the p value
as the area of the t distribution more positive or more
negative (but not both) than the value of t obtained from
(yA  - yB)/sd).  (The limiting values of p then are 0, as
the absolute value of t becomes indefinitely large, and 1/2,
as the value of t approaches zero.)

For any specified positive or negative population mean
difference, there may be found in the usual way the
probability of a Type II error, of withholding judgment when
the parametric difference is as specified.  For each
specified difference, the probability of a Type II error is
smaller than that for the conventional two-tailed test of
significance.  Thus, the proposed procedure is uniformly
more powerful than the conventional procedure.

Hodges & Lehmann (1954) proposed a modification of the
traditional Student test, converting it from a two-sided
test of the null hypothesis to two one-sided tests.  Kaiser
(1960) proposed combining the two one-sided tests into a
single test, but one with two directional alternative
hypotheses, µA < µB and µA  > µB.  (For further discussion,
see Bohrer, 1979, Bulmer, 1957, and Harris, 1997.)  However,
the unrealistic null hypothesis of zero mean difference in
the population is included in these proposals, in contrast
to the formulation above.   By acknowledging the fiction of
the null hypothesis, and following the implications from
"every null hypothesis is false," our formulation yields,
for any sample size and any value of a, a test with greater
reported sensitivity to detect the direction of a difference
between two population means.

Note that those accustomed to make "tests of hypotheses" at
.05 would, using the procedures set forth here, do the same
arithmetic but would describe their results as a "test of
significance" at one-half of .05, i.e., at .025.
Alternatively, to maintain at .05 the probability of acting
as if the parametric difference is in one direction when, in
fact, it is in the other, the investigator would employ the
.10 tabled value of  alpha.

</extract>

In this context where the objective is to draw attention to
differences of importance it might make sense to use two 1
tailed t tests where the test is against a difference of a
particular size (a threshold). This thresholding would
presumably reduce the number of false positives (type I
error), but increase the number of  "no calls" (false
negatives-Type II errors)

So far, so reasonable. But clearly we are conducting
multiple significance tests .

With 5 tests in a single column and alpha at .05, the
probability of  finding at least one significant difference
when there are none is 1-.95^5 is 0.22 : but this is only a
rough calculation because the 5 tests are not independent.


Setting alpha at .01 is a partial solution, perhaps ..
1-.99^5 is .05 .. but again at the expense of the power of
the test .. we will get more false negatives.

Is there some way we can make a sensible tradeoff  here?
(Since the sample sizes are fixed, the only factors
affecting power are the presumed/desired detectable effect
size and alpha.)

I don't know how to calculate the power of a set of
independent tests ... can anyone help me here?

Still on the "single column, multiple tests" : other
possibilities &#8230;

a)      the Bonferroni method.. multiply
alpha by 5 (or should it be 4?)

b)      a preliminary chi-square test at alpha=0.05 , no differences to be
reported unless the chi-square test suggests that there are
indeed differences.  If I do do the preliminary chi-square,
should I then also do the Bonferroni adjustment?


So far, I have considered only one column. These tests are
to be applied to all columns .. so, 50 such tests. The
columns most certainly do not represent independent
samples.. clearly, from the descriptions, they overlap.

I am semi convinced that, having solved the one column
problem, no further adjustments to the testing procedure are
needed. The simple argument is reductio-ad-absurdem .. if I
added a column (L)  which was identical to an existing
column (K), would I expect the "significance results" in
column K to be affected by this action? No. What if L was
not identical to K? same answer. So, the one-column at a
time idea seems OK. Maybe.

Now, consider another table .. same "break" but this time
the field "down" the page is "Frequency of Purchase".
Another set of 50 tests.

Obviously if we keep on doing this (adding tables) we are
going to find "significant results". But I am not sure that
the answer to the problem is to apply some mass (across
tables/across variates) Bonferroni adjustment.

Can anyone give me some technical/conceptual help here?


0
Comment
Question by:Mutley2003
  • 2
  • 2
5 Comments
 
LVL 3

Expert Comment

by:rfr1tz
ID: 9731963
This is one hell of a long question.

IMO, you should make up a small numerical example, ask the question(s) in simple terms and hope someone can answer it. Then extend this solution to your problem.

My guess is that we're talking about analysis of variance (or maybe analysis of covariance which I know basically nothing about).
0
 
LVL 2

Accepted Solution

by:
Itatsumaki earned 500 total points
ID: 9742465
I followed everything you wrote.  This type of multiple-testing problem has become common in genomics research in the last couple of years.  Basically the problem is that the number of replicates is far smaller than the number of tests: this occurs very commonly with DNA microarrays.  One approach you should consider is a permutation approach.  Basically a modified version of the t-stat is used, and overall control is of the False Discovery Rate (FDR: family-wise FDR to be precise) through a single parameter.  To estimate the underlying null distribution, the labels (in your case rows) are repeatedly permuted.  This approach is called SAM (Significance Analysis of Microarrays, by the Tibshirani group at Stanford).

The analogy isn't exact, because in your case you don't quite have replicates, but the similarity is striking.  In your case, it would tell you: expect n significant results at random.  It wouldn't identify them for you, but it would give you an idea of how much error you have, and by increasing your cutoff level you could tune sensitivity/specificity (e.g. variance/bias).

Anyways, my experience with multiple-testing is that a simple Bonferroni adjustment doesn't work well in most cases.  Did you look at a Westfall step-down procedure?
0
 

Author Comment

by:Mutley2003
ID: 9762441
Itatsumaki

Thank you . I was unaware of the Westfall procedure but subsequent to your suggestion I have done a little research and can see potential applicability.

Thank you also for the suggestion of parallels with genomics research and the analysis of microarrays : this gives me some more food for thought and most probably a productive line of enquiry.

btw, the Westfall procedure is implemented in R as procedure mt.maxT in package multtest, so I can readily do some experimentation.

thank you again

0
 
LVL 2

Expert Comment

by:Itatsumaki
ID: 9774413
If you get a chance, post back and let me know how things turn out: I'm curious if the genomics parallel works well.  In genomics we frequently say that our problem is unique in statistics, so if the analogy between the two holds it will be a surprise to a lot of people.

-Tats
0
 

Author Comment

by:Mutley2003
ID: 9785371
Tats, OK will do. I think it will take a while for me to do the experimentation so maybe we should take this offline .. if you email me
at John @ Data Sciences Research . com (remove all the spaces) I can keep you up to date with my progress.

The genomics problem set may well be unique in statistics, in that the number of dimensions routinely measured per individual is >>>the number of individuals. And with an unknown correlation structure? or one that has to be estimated from the data at hand.  I cannot help thinking that there must be analogous situations where a large ratio of attributes measured to cases measured upon is the natural order of things .. perhaps in physics, high dimensional time series .. well, just idle speculation.

There remains a substantive problem in my little corner of the world .. how to sensibly draw attention to differences that are worth following up, understanding that there is a cost in the follow up and a cost of missing out on some gems.

I shall pursue this. Thank you for informing my thinking.

rgds



0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Probability Distribution 8 52
fizzArray3  challenge 1 68
scoresAverage challenge 8 76
Revenue table 8 71
Software development teams often use in-memory caches to improve performance. They want to speed up access to, or reduce load on, a backing store (database, file system, etc.) by keeping some or all of the data in memory.   You should implement a …
Article by: Nicole
This is a research brief on the potential colonization of humans on Mars.
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.
You have products, that come in variants and want to set different prices for them? Watch this micro tutorial that describes how to configure prices for Magento super attributes. Assigning simple products to configurable: We assigned simple products…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now