Before doing a linear regression on a data-set one needs to find out a few things about the data:

Questions:

Thank you!

- is there is a linear relation between the dependent and independent variables
- are the variables normally distributed (if not, do a Data Transformation)

Questions:

- Book1.xlsx
- In Excel, do I simply use the CORREL() function to determine a linear relation between the dependent and independent variables?
- How do I interpret these results (i.e. low value = low correlation, high value = high correlation)?
- To test normality, can I simply use the QQ Plot test or should I use other tests, like the K-S Test as well?
- How do I interpret the K-S Test results?
- How do I go about doing a "Data Transformation" and how do I know whether I should do it?
- Please do not confuse things with only Statistics talk when giving an answer because I might not understand what you're saying. I've uploaded a sample of a data-set.

Thank you!

Experts Exchange Solution brought to you by

Enjoy your complimentary solution view.

Get this solution by purchasing an Individual license!
Start your 7-day free trial.

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Math / Science

From novice to tech pro — start learning today.

Experts Exchange Solution brought to you by

Enjoy your complimentary solution view.

Get this solution by purchasing an Individual license!
Start your 7-day free trial.

There are many data sets that will have *some* correlation.

But, this may be exactly what you need.

It seems to me that it's better to remove any straight line component from the data if what you're interested in is the variability alone.

Or, equivalently, one can calculate the differences between the data set and a least squares straight-line fit to the data.

This isn't a comprehensive treatment but it has relationship to classical analysis methods.

The first thing you would look at is the *sign* of the correlation coefficient.

If it's positive then there's a positive correlation to some degree - meaning that the two variables track "positively". When x gets bigger, y gets bigger.

If it's negative then the opposite. When x goes positive, y goes negative.

The value can be anywhere between -1 and +1. "1" means perfectly correlated - they track 100% and the sign tells you in which direction.

Anything greater than -1 and less than +1 is a measure of "how much?". Zero means "not at all".

So, looking at your data and calculations, the number of cylinders matters not much at all.

But, the displacement matters a lot and the gears matter a fair amount (both negatively). i.e. the more displacement the worse fuel economy with almost 100% correlation. And, the more gears the worse fuel economy but not so greatly.

This all depends on your objectives and degree of comfort with the methods.

I might call this a "mapping" where you put the data through a formula (often one that can be reversed). I think this is more an "art form" and experience would probably mean a lot. I would read about it and seek examples to gain some insight. It's not always necessary or desirable.

An example would be to plot the data as a function of the reciprocal of the "fuel economy" (which is actually the

distance per unit volumeand not "economy") intounit volume per distance.All that said, it's quite important to have a statement of the objectives or questions that you are trying to address. None of that is presented in this question or on the spreadsheet. Here are some examples:

1) How does fuel rate of consumption vary with the number of cylinders? Is there a relationship? How much?

2) How does fuel rate of consumption vary with the displacement? Is there a relationship? How much?

3) How does fuel rate of consumption vary with the number of cylinders but with the same displacement?

Your objectives and questions may be different of course. But it's this starting point that affects the answers to the questions you have asked.