Pre-Linear Regression Tests userTester used Ask the Experts™
on
Before doing a linear regression on a data-set one needs to find out a few things about the data:

• is there is a linear relation between the dependent and independent variables
• are the variables normally distributed (if not, do a Data Transformation)

Questions:

1. Book1.xlsx
2. In Excel, do I simply use the CORREL() function to determine a linear relation between the dependent and independent variables?
3. How do I interpret these results (i.e. low value = low correlation, high value = high correlation)?
4. To test normality, can I simply use the QQ Plot test or should I use other tests, like the K-S Test as well?
5. How do I interpret the K-S Test results?
6. How do I go about doing a "Data Transformation" and how do I know whether I should do it?
7. Please do not confuse things with only Statistics talk when giving an answer because I might not understand what you're saying. I've uploaded a sample of a data-set.

Thank you!
Comment
Watch Question

Do more with EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Principal
Commented:
In Excel, do I simply use the CORREL() function to determine a linear relation between the dependent and independent variables?
I would not use CORREL because, as it appears you have found, then what?  It gives you an "indicator" but not something you can *use*.,So it can be used for interpreting the data but perhaps not operating on it further.
There are many data sets that will have *some* correlation.
But, this may be exactly what you need.

It seems to me that it's better to remove any straight line component from the data if what you're interested in is the variability alone.
Or, equivalently, one can calculate the differences between the data set and a least squares straight-line fit to the data.
This isn't a comprehensive treatment but it has relationship to classical analysis methods.
How do I interpret these results (i.e. low value = low correlation, high value = high correlation)?
The first thing you would look at is the *sign* of the correlation coefficient.
If it's positive then there's a positive correlation to some degree - meaning that the two variables track "positively".  When x gets bigger, y gets bigger.
If it's negative then the opposite.  When x goes positive, y goes negative.
The value can be anywhere between -1 and +1.  "1" means perfectly correlated - they track 100% and the sign tells you in which direction.
Anything greater than -1 and less than +1 is a measure of "how much?".  Zero means "not at all".
So, looking at your data and calculations, the number of cylinders matters not much at all.
But, the displacement matters a lot and the gears matter a fair amount (both negatively).  i.e. the more displacement the worse fuel economy with almost 100% correlation.  And,  the more gears the worse fuel economy but not so greatly.
To test normality, can I simply use the QQ Plot test or should I use other tests, like the K-S Test as well?
This all depends on your objectives and degree of comfort with the methods.
How do I go about doing a "Data Transformation" and how do I know whether I should do it?
I might call this a "mapping" where you put the data through a formula (often one that can be reversed).  I think this is more an "art form" and experience would probably mean a lot.  I would read about it and seek examples to gain some insight.  It's not always necessary or desirable.
An example would be to plot the data as a function of the reciprocal of the "fuel economy" (which is actually the distance per unit volume and not "economy") into unit volume per distance.

All that said, it's quite important to have a statement of the objectives or questions that you are trying to address.  None of that is presented in this question or on the spreadsheet.  Here are some examples:

1) How does fuel rate of consumption vary with the number of cylinders?  Is there a relationship?  How much?
2) How does fuel rate of consumption vary with the displacement?  Is there a relationship?  How much?
3) How does fuel rate of consumption vary with the number of cylinders but with the same displacement?
Your objectives and questions may be different of course.  But it's this starting point that affects the answers to the questions you have asked.

Commented:
Thanks, this helps.

Do more with Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.