Pre-Linear Regression Tests

Before doing a linear regression on a data-set one needs to find out a few things about the data:

  • is there is a linear relation between the dependent and independent variables
  • are the variables normally distributed (if not, do a Data Transformation)

Questions:

  1. Book1.xlsx
  2. In Excel, do I simply use the CORREL() function to determine a linear relation between the dependent and independent variables?
  3. How do I interpret these results (i.e. low value = low correlation, high value = high correlation)?
  4. To test normality, can I simply use the QQ Plot test or should I use other tests, like the K-S Test as well?
  5. How do I interpret the K-S Test results?
  6. How do I go about doing a "Data Transformation" and how do I know whether I should do it?
  7. Please do not confuse things with only Statistics talk when giving an answer because I might not understand what you're saying. I've uploaded a sample of a data-set.

Thank you!
userTesterAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Fred MarshallPrincipalCommented:
In Excel, do I simply use the CORREL() function to determine a linear relation between the dependent and independent variables?
I would not use CORREL because, as it appears you have found, then what?  It gives you an "indicator" but not something you can *use*.,So it can be used for interpreting the data but perhaps not operating on it further.
There are many data sets that will have *some* correlation.
But, this may be exactly what you need.

It seems to me that it's better to remove any straight line component from the data if what you're interested in is the variability alone.
Or, equivalently, one can calculate the differences between the data set and a least squares straight-line fit to the data.
This isn't a comprehensive treatment but it has relationship to classical analysis methods.
How do I interpret these results (i.e. low value = low correlation, high value = high correlation)?
The first thing you would look at is the *sign* of the correlation coefficient.  
If it's positive then there's a positive correlation to some degree - meaning that the two variables track "positively".  When x gets bigger, y gets bigger.  
If it's negative then the opposite.  When x goes positive, y goes negative.
The value can be anywhere between -1 and +1.  "1" means perfectly correlated - they track 100% and the sign tells you in which direction.
Anything greater than -1 and less than +1 is a measure of "how much?".  Zero means "not at all".
So, looking at your data and calculations, the number of cylinders matters not much at all.
But, the displacement matters a lot and the gears matter a fair amount (both negatively).  i.e. the more displacement the worse fuel economy with almost 100% correlation.  And,  the more gears the worse fuel economy but not so greatly.
To test normality, can I simply use the QQ Plot test or should I use other tests, like the K-S Test as well?
This all depends on your objectives and degree of comfort with the methods.
How do I go about doing a "Data Transformation" and how do I know whether I should do it?
I might call this a "mapping" where you put the data through a formula (often one that can be reversed).  I think this is more an "art form" and experience would probably mean a lot.  I would read about it and seek examples to gain some insight.  It's not always necessary or desirable.
An example would be to plot the data as a function of the reciprocal of the "fuel economy" (which is actually the distance per unit volume and not "economy") into unit volume per distance.

All that said, it's quite important to have a statement of the objectives or questions that you are trying to address.  None of that is presented in this question or on the spreadsheet.  Here are some examples:

1) How does fuel rate of consumption vary with the number of cylinders?  Is there a relationship?  How much?
2) How does fuel rate of consumption vary with the displacement?  Is there a relationship?  How much?
3) How does fuel rate of consumption vary with the number of cylinders but with the same displacement?
Your objectives and questions may be different of course.  But it's this starting point that affects the answers to the questions you have asked.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
userTesterAuthor Commented:
Thanks, this helps.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Math / Science

From novice to tech pro — start learning today.