should I do chi-analysis with cross-tabulation?

Posted on 2006-04-24
Last Modified: 2012-05-05
Two questions in my survey:
I got a question that ask companies whether they will join a portal.
I got another question that ask companies what kind of size they are in terms of staff strength.

By doing a cross-tabulation, I find that small and accelerating growth companies will be willing to join from the cross-tab.
Intially , I thought only small companies will join.

Question: should I continue with chi-test or chi-analysis?
              or is cross-tab enough to prove my hypothesis ?

How can i do chi-test when I m not sure whats the expected value?
Question by:spiral
    LVL 26

    Accepted Solution

    Look at

    The Chi Square test would be appropriate if you could break your respondents in two groups,
    large and small companies for example, or perhaps growing or stable companies.

    But this is a problem, because company size is an analog value.  
    You could pick some threshold so that  [greater than ==> BIG]  and  [less than ==> SMALL].
    But this arbitrarily and unnecessarily smears your data.

    A quantitative and more convincing approach if you have enough data would be to plot the probability of
    a company joining a portal versus company size.  You might be able to do a linear regression and get a good
    correllation coefficient.

    Author Comment

    ok but if i ask them

    whether they are just startups, moderate growth, declining.... (which are a few groups NOT two)

    should be ok to do chi-test right?

    I am trying to standardise the method to use for doing analysis on my survey qns... i got 60+ the time i plot the graph for that 1 question will be very tedious and it will take too long for just doing analysis on 1 question.
    LVL 1

    Expert Comment

    There are two reasons to perform a chi-squared-test.

    1. To prove with an absolute unsertainty that your data isn't just random static (that there's a correlation between company size and their willingness to join (test for independence).

    2. To prove a with an absolute unsertainty a hypothetical formula for the covariation of company size and willingness to join.

    I can only tell you about the first. It's fairly simple.

    Here the expected value, for any given cell in your crosstab, are equal to the product of the marginal counts divided by the total count.
    You'll have to put buisness size into as many groupings as you deem appropriate.

    Once you got the expected values you can calculate the chi-squared value for each cell, and their sum is the chi-squared value of the crosstab.

    Basically, if you don't do a chi-squared test for independence, you can't prove that your apparently convincing data isn't just a result of an unlucky (and/or too small) draw.

    To be even more statistically correct, you should start with a hypothesis saying that there's no correlation, then once you prove that the chances of that being true is only 5% (or 2.5% or 1% or whatever). Then you can accept the alternative hypothesis which is that small and accelerated growth companies will be willing to join.

    It would be statisticly incorrect AFAIK to take the correlation koeffecient between a nominial and scale variable.

    So, to sum it all up, yes, you should do a Chi-squared test for independence at the very least.


    Author Comment

    some of the cells in my cross-tab are 0 and 3 are 1s. Can I still do a chi-test? i got 63 responses.

    Can someone point me to a good resource that explains regression analysis and linear regression? trying to decide which is the best approach here.
    LVL 1

    Expert Comment

    How many companies did you ask?

    IIRC then yes, you can do a chi-test with those numbers, but you'll have to place the uncertainty limit pretty high in order to take your alternative hypothesis.

    One might argue that _because_ your numbers are so small, it would be necessary to prove that your results aren't just random.

    I don't know much about regression, maybe you should try this site:
    LVL 1

    Expert Comment

    You could also just create larger groups with regards to buisiness size, that would put more companies in each square.

    Author Comment

    i got 63 responses.... but when ask whether the company is decling (not making business) few answered so resulting in some cells 0.  What do u mean by larger groups with regards to business size? can give an example?
    LVL 1

    Expert Comment

    say you have a croostab of: "size" by "willingness to join"

    You create a crosstab with 2 coulums (will join jes/no)
    and 8 rows, for 8 different size categories for the companies.

    Say if you were to group the 8 groups into 4, you would have more responces in each cell.
    You should group them where responces are thin. For example if there are few responces in size groups 1-3 and 7-8 you should group them, that would give you 5 groups:

    If this doesn't make much sense, could you please explain your data, goal and problems in more detail.

    Author Comment

    ok my company's group according to
    Acelerating. Growth,
    Moderate Growth,

    Whether they like to join the website:

    so I got 1 cross-tab

                            yes      no           dunno          
     Growth stage    10      6                4  
                            4        3                4
                            13       6               1
                              7      3                1
        ..declining       0        0               1
                           34         18            11

    Author Comment

    I like to prove that only start-ups will be likely to be the most to join...but from the cross-tab, seems half-correct.
    LVL 1

    Expert Comment

    Right so that gives us:

                               Yes        No       marginal
    Accelerated          10          6         16
                                4           3         7
                               13          6         19
                               7            3         10
    Declining              0            0         0
    -Marginal             34         18        52

    The chi-value is 0,047235736

    Which means that there is is a dependence between growth rate and willingnes to join.

    I'll get back to you later
    LVL 1

    Expert Comment

    I miscalculated before, here's the result:

                             Yes   No            
    Startup                      10      6      16      
    Accelerating growth   4         3         7      
    Moderate growth          13        6        19      
    Mature                       7      3      10      
    Declining               0      0      0      
                             34   18        52      
                                Yes          No            
    Startup                         10,46      5,54      16      
    Accelerating growth      4,58         2,42    7      
    Moderate growth             12,42        6,58        19      
    Mature                          6,54      3,46      10      
    Declining                   0        0          0      
                                 34           18            52      
                                       Yes         No            
    Startup                         0,02      0,04            
    Accelerating growth    0,07        0,14            
    Moderate growth             0,03        0,05            
    Mature                        0,03      0,06            

    Sum:      0,44   <------ This is chi-squared
    For a 5% significance level the critical value with 3 degrees of freedom (I excluded the declining growth) is  7,81
    Thus your data is significant and not just random.

    As you said, the data doesn't meet up to your expectations.

    Theory: Maybe accelerated growth companies allready have many offers like yours and are thus not so ready to join, while startup companies and moderate growth companies will take anything they can get?

    Author Comment

    question: how do u get the critical value of 7.81?   0.44 u round up to 5% significance?

    according to the guide, ch-square has some requirements and my data has zero and 1s so it doesnt fulfill the requirements. But is it ok if i leave out the data and do a chi-test (like what u did)?

    The sample must be randomly drawn from the population.
    Data must be reported in raw frequencies (not percentages);
    Measured variables must be independent;
    Values/categories on independent and dependent variables must be mutually exclusive and exhaustive;
    Observed frequencies cannot be too small.
    1) As with any test of statistical significance, your data must be from a random sample of the population to which you wish to generalize your claims.

    2) You should only use chi square when your data are in the form of raw frequency counts of things in two or more mutually exclusive and exhaustive categories. As discussed above, converting raw frequencies into percentages standardizes cell frequencies as if there were 100 subjects/observations in each category of the independent variable for comparability. Part of the chi square mathematical procedure accomplishes this standardizing, so computing the chi square of percentages would amount to standardizing an already standardized measurement.

    3) Any observation must fall into only one category or value on each variable. In our footwear example, our data are counts of male versus female undergraduates expressing a preference for five different categories of footwear. Each observation/subject is counted only once, as either male or female (an exhaustive typology of biological sex) and as preferring sandals, sneakers, leather shoes, boots, or other kinds of footwear. For some variables, no 'other' category may be needed, but often 'other' ensures that the variable has been exhaustively categorized. (For some kinds of analysis, you may need to include an "uncodable" category.) In any case, you must include the results for the whole sample.

    4) Furthermore, you should use chi square only when observations are independent: i.e., no category or response is dependent upon or influenced by another. (In linguistics, often this rule is fudged a bit. For example, if we have one dependent variable/column for linguistic feature X and another column for number of words spoken or written (where the rows correspond to individual speakers/texts or groups of speakers/texts which are being compared), there is clearly some relation between the frequency of feature X in a text and the number of words in a text, but it is a distant, not immediate dependency.)

    5) Chi-square is an approximate test of the probability of getting the frequencies you've actually observed if the null hypothesis were true. It's based on the expectation that within any category, sample frequencies are normally distributed about the expected population value. Since (logically) frequencies cannot be negative, the distribution cannot be normal when expected population values are close to zero--since the sample frequencies cannot be much below the expected frequency while they can be much above it (an asymmetric/non-normal distribution). So, when expected frequencies are large, there is no problem with the assumption of normal distribution, but the smaller the expected frequencies, the less valid are the results of the chi-square test. We'll discuss expected frequencies in greater detail later, but for now remember that expected frequencies are derived from observed frequencies. Therefore, if you have cells in your bivariate table which show very low raw observed frequencies (5 or below), your expected frequencies may also be too low for chi square to be appropriately used. In addition, because some of the mathematical formulas used in chi square use division, no cell in your table can have an observed raw frequency of 0.


    Author Comment

    d-glitch: are we wrong here? please discuss this issue.

    Author Comment

    please discuss this issue so i can award the points to whoever can enlighten me. Thanks. It is urgent.
    LVL 1

    Expert Comment

    Basically the only one that's potentially problematic is the last one, as the expected frequencies are as low as 2,42  3,46 and 4,58.
    If you feel this is significant, you should not make the test.

    Actually, I think I goofed up again :(

    You see, what we are testing is the probability that our hypothesis (that they're independent) is true. The 5% significance level is a decision that if there's 5% or more probability of this being true, we keep the hypothesis, otherwise we don't.
    That is, if our number is larger than the critical value, the two variables are dependent.
    If our number is smaller than the critical value we have to assume that they're independent.

    With a chi-square-value of only 0,44 we have to assume that the two variable are independent.
    Contrary to the example in the guide our value is less than the critical:

    >Table 1's chi square value of 14.026, with 4 degrees of freedom, handily clears the related
    >critical value of 9.49, so we can reject the null hypothesis and affirm the claim that male
    >and female undergraduates at University of X differ in their (self-reported) footwear

    This unfortunately means that you can't conclude anything much from your data.

    Correct me if I'm wrong d_glitch

    Author Comment

    so i should or should not stick to chi-test?
    LVL 1

    Expert Comment

    Well... it depends.

    The chi-test tells you that you can't use your survey because the results might as well be random.
    Of course, there's no real way to tell wheather this is due to the actual results or the low numbers.

    I would say that you should do a chi-test, and make sure you have a larger sample next time.
    Also, if you include the "dunno" category in the chi-test, the result might be different.

    But yes, it is proper research practice to test for the significance of your findings.

    Do you know how to do a confidence interval?

    Author Comment

    by:spiral i m not very sure about the confidence interval.
     I did some calculation but if i include the dunno column i get negative results.
    the rule say I cannot include cells with how?

    Also not very sure abt the linear regression that  d_glitch talk about. I like to finish the analysis of my survey even though my data is not 100% perfect though. So can more pple join in this discussion?


    Author Comment

    JayJay : are u there?
    LVL 1

    Expert Comment

    Yes I'm still here :)

    I still don't know how to do a linear regression.

    Confidense intervals tell you where the mean is located within, but I'm betting the interval will span from mature to startup.
    Show me your calculations and I'll take a look, there are a lot of ways you could do a confidense interval with that table.

    Most likely, with the result of the chi-analysis, any statistical testing you do will give you bad results (again I don't know about the linear regression though).

    I suggest sticking to describing the data instead of doing all sorts of statistics that would discredit your results.

    Author Comment

               Use Dynamic Portal ?            
    Stage of Development      Yes      No      Don't know      N
    Startup                     10      6      4      20
    Accelerating Growth      4      3      4      11
    Moderate Growth      13      6      1      20
    Mature Growth      7      3      1      11
    Declining                      0      0      1      1
    Total                      34      18      11      63

    hmm whats the use of finding the mean?
    LVL 1

    Assisted Solution

    Iff you give each category a value from 1 to nine, then find the mean with regards to people who answered yes, you can see who's most likely to buy the service/product.

    Startup 10*1 = 10
    Acc 4*2 = 8
    Moderate = 13*3 = 39
    Mature = 7*4 = 21
    Decl = 0*5 = 0

    sum: 78
    78/34 = 2,29
    that is most companies around group 2 will buy the product (or companies equally spaced from 2,29).

    You could also do this with regards to the yes/no/dunno axis:

    Yes: 34*1 = 34
    No: 18*2 = 36
    Dunno: 11*3 = 33

    sum: 103
    103/63 =1,63

    So most people will answer somewhere between 1 (yes) and 2(no) but leaning towards 2.

    However, if you do a confidence interval the picture isn't that clear.

    [mean-critical value*sqrt(standard deviation/sample size) ; mean+critical value*sqrt(standard deviation/sample size)]

    For the first mean of 2,29 with a variance of 195 we get an interval of 1,24 to 3,76
    Companies most likely to buy you product will be in the upper 3 groups of your table.

    For the second mean og 1,63 and variance 1403 we get an interval of [0,12 ; 3,15]

    So basically that tells us that the true average is somewhere between 0,12 (below yes) to 3,15 (above dunno) so it doesn't really tell us anything.
    I used a significance level of 5% and a critical value of 1,96.

    I don't know if this helped or not :s

    Author Comment

    by:spiral the confidence level isn't of much help?

    LVL 1

    Expert Comment

    I doubt any statistics will do anything but deface your results.

    Featured Post

    Courses: Start Training Online With Pros, Today

    Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

    Join & Write a Comment

    We are taking giant steps in technological advances in the field of wireless telephony. At just 10 years since the advent of smartphones, it is crucial to examine the benefits and disadvantages that have been report to us.
    Lithium-ion batteries area cornerstone of today's portable electronic devices, and even though they are relied upon heavily, their chemistry and origin are not of common knowledge. This article is about a device on which every smartphone, laptop, an…
    Sending a Secure fax is easy with eFax Corporate ( First, Just open a new email message.  In the To field, type your recipient's fax number You can even send a secure international fax — just include t…
    This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor ( If you're looking for how to monitor bandwidth using netflow or packet s…

    755 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    22 Experts available now in Live!

    Get 1:1 Help Now