# Determining most influential independant variables on the dependent variable

on
I have about twenty independent variables and one dependent variable.  How do I determine the most "influential" independent variables on the dependent variable.  Eventually I'd like to run a regression analysis on the most influential independent variables vs the dependent variable to come up with an equation relating them, but first need to narrow this down.

Any websites, tools, etc...would be great.  I have all my data in excel too.
Comment
Watch Question

Do more with

EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Professor

Commented:
You actually narrow it down with regression itself - although it's multiple regression, specifically.  You can compare beta weights to determine the most important predictors in your model by entering all of your IVs simultaneously (you might call this a fully saturated model).  Whichever beta is highest is the best predictor in your dataset.  But be aware that if you have substantial intercorrelations amongst your predictors, this may be misleading, as one predictor may drown out the beta of another it is correlated with.  This approach also highly capitalizes on chance, but without any underlying theory by which to support which predictors you're choosing, it's basically the only way to proceed.

This SPSS-based regression guide might help: http://www.ats.ucla.edu/stat/SPSS/webbooks/reg/chapter1/spssreg1.htm
Professor

Commented:
It also occurred to me you might be looking for free stats software.  OpenStat is pretty good (it mimics SPSS to some degree) and is reasonably user-friendly: http://www.statpages.org/miller/openstat/

R is much more powerful and also free, but at the cost of a statistics non-expert being very unlikely to understand how it works: http://www.r-project.org/
Senior Risk Manager
Commented:
I would recommend using a stepwise regression analysis.  Any stats software should allow for forward, backwards, and stepwise regressions.

http://en.wikipedia.org/wiki/Stepwise_regression

R is a great package, but even I had trouble first getting used to it in my master's program for statistics.  If you have access to SPSS, SAS, Matlab, JMP (on mac), or some other stats package, it may be easier to use.

I will say there is a GUI add-in for R that makes most basic statistical analyses fairly simple, just as in any other GUI based software package.  I believe it is called "R Commander".  It also will show you the syntax for the procedures you select, similar to what I know you can also do in SPSS.

WC
Professor
Commented:
Stepwise multiple regression will give you the same answer as my approach if you are only interested in extracting the single most influential predictor (although technically speaking, I suppose you could just look at a correlation matrix for that anyway).  The major difference is that with stepwise regression, multicollinearity (the predictor intercorrelations that I mentioned earlier) can lead you to a more misleading answer.

The reason is that stepwise regression will identify the predictor with the highest correlation with your criterion, hold it aside, and see which of the remaining predictors adds the most value.  Which in turn means that any predictors remaining highly correlated with that first predictor will be unlikely to be picked, even though they might have practical value for whatever it is that you're doing.  This process continues iteratively until little value is added by any more predictors.  For an example of the problem:

correlation a v. outcome = .7
correlation b v. outcome = .68
correlation a vs. b = .8

Stepwise regression of your outcome on a and b will identify predictor a as a valuable predictor and predictor b as one to be discarded, even though the reality of the situation (depending on what a and b really are, how much they cost, or any other number of contextual factors) might specify a different answer, depending on your objectives.

It's tempting to look for a single analytic strategy that will provide you with "the answer," but this can be dangerous.  I apply statistics to business settings, where such misinterpretation and overinterpretation is common - for example, if a cost 10x more than b, it would still be more profitable to use b, despite the slightly better fit of a in the model.  If the setting where you are applying this is less critical, then this consideration may be less important.
Senior Risk Manager

Commented:
Rich,

Good comments, but as with any analysis, business knowledge will weigh in to any decisions, and is usually more important.  The question here seems to want to know a group of most predictive independent variables.  Obviously, if business knowledge says certain variables "need" to be in the model, then the analyst should hand pick which variables to add in or not.  Since it seems that the author isn't sure which variables should be used, I was suggesting stepwise as a valid way of determining a model.  Your points are valid.  I don't think it's necessarily a misleading answer, and it isn't just a way to the single most influential predictor.  If the top predictors aren't highly correlated, they will still come into the model.  Your example, while valid, isn't necessarily what will happen with the top predictors.  Also, if there are a number of highly correlated variables, the model becomes overly complicated with redundant information.  Simplicity and predictability usually come at a trade-off and the "art" part of building models definitely comes in to play.

Anyways, I was just suggesting an option for the author.  If they really just want to see the predictive power of each independent variable against the dependent variable, regardless of interaction of the independent variables, then yes, a fully-saturated regression model will give the p-values for all variables and then the analyst can perhaps "cherry-pick" variables based on that info and their business knowledge.

WC

Commented:
Hello Rich and WarCrimes,

Thank you for the depth of information.  I am looking for a group of predictors and with my lack of knowledge for regression analysis I have been taking the r-squared values of each individual predictor against the regressor to find which predictors were "substantial", but I see there are obviously much better and accurate ways to determine influential predictors and colinearity.  Thanks, I'm goign to sift through the info.
Professor

Commented:
You're certainly correct - I just took a different interpretation on the asker wanting to "know a group of most predictive independent variables."  The key is if he wants to know the variables most predictive in his sample or in some unspecified population that sample represents.  If the former, stepwise regression alone with little attention to interpretation would be sufficient.  If the latter, some finessing may be required, and this is where the risk of interpretive error increases.  Frankly, we don't even know if the poster is trying to specify a model at all.  In my experience with amateur statisticians, if you hand them an analysis and say "use this," they will rarely pay much attention to the details.

Also, for the record, I didn't say stepwise was only a way to the single most influential predictor - just that the highest beta weight from a simultaneously entered set of predictors and the first variable a forward-stepping stepwise regression pulled out would be the same variable, based on the way each is calculated.
Professor

Commented:
bs329 - That sounds like a good plan.  :)  Looking at the R^2s does give you a good indicator of how well each predictor is related to your criterion individually, and if there is no predictor intercorrelation, that's a fine approach - but if your predictors intercorrelate, that's when the contingencies we both described begin the apply.

Let us know what you find!
Senior Risk Manager

Commented:
richdiesal said:
In my experience with amateur statisticians, if you hand them an analysis and say "use this," they will rarely pay much attention to the details.

--------

Isn't that the truth.  ;)

Commented:
Hello guys, I read through the material and some other references I found.  I am currently working on a large set of data (thousands of rows, tens of regressors, and one response variable) in excel weekly.  Is there a tool out there where I can take all this data and run it through stepwise regression.  I'd like to find an add-in where I can automate all this given some alpha cut off.  Then with the model that results from the stepwise regression I wanted to compare the response variable to the model value (by plugging the predictors into the model for each response).

thanks,
Professor

Commented:
There is not any fully automated tool to do so that I am aware of.  You could potentially program your own scripts to do what you want (SPSS for example can run Python scripts), but that might take longer than just running it yourself.

The middle ground is using scripting within the stats software itself.  If you were able to conduct your stepwise regression in SPSS, for example, you can just hit the "Paste" button to automatically create code (called syntax, in SPSS-speak) that will recreate your analysis, which you can reuse later without having to go through the interface, which I believe meets your "add-in where I can automate all this given some alpha" criterion.

I'm not 100% on what you mean by "compare the response variable to the model value" but if you're trying to compute residuals to measure the accuracy of your final regression for each case, that can be done automatically in SPSS as well (and in syntax).

Commented:
Sorry for the late grade.  Thanks for the help.  I"m going to implement step-wise regression and use SPSS to compute residuals.

Do more with