Understanding sales data

DColin used Ask the Experts™
Hi Experts,

Please look at the attached data file. Column 1 is if a customer purchased extended warranty with their purchase 1=yes 0 = no.  Column 2 is the amount of money the customer spent on their purchase. I would like to know based upon how much a customer spends what is the percentage chance of them purchasing extended warranty?  data.txt
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Hi Colin - this is a fairly straightforward application of linear regression, which summarizes the relationship in terms of the formula for a line (y = bx + a).  Since you want to predict whether or not they purchased a warranty, that should be your criterion (DV), while money spent will be your predictor (IV).  


Hi richdiesal:

I should have mentioned that I know nothing about statistics so you will need to dumb this down for me. One thing I notice about your reply is you suggest the answer is linear. It would seem to me, that the answer can not be linear because when a customer spends 0 money there is a 0% chance of them purchasing a warranty, but there can never be a 100% chance they will purchase a warranty.
Ahh, that's a good point.  You probably want logistic regression to model the outcome as a probability.

Unfortunately for your understanding of statistics, linear regression is the "easier" of these two - logistic regression is much more complicated than linear regression.

The basic premise behind logistic regression is that you are trying to predict a probabilistic outcome (0% to 100%) from a continuous predictor.  It creates graphs like the one here: http://en.wikipedia.org/wiki/Logistic_regression - which allows you to predict the probability of your outcome at any given level of your predictor.

You'd need a statistical analysis program to conduct this kind of analysis - SPSS, SAS, R, Minitab, for example.
Rowby Goren Makes an Impact on Screen and Online

Learn about longtime user Rowby Goren and his great contributions to the site. We explore his method for posing questions that are likely to yield a solution, and take a look at how his career transformed from a Hollywood writer to a website entrepreneur.

I do not see a linear relationship between anything
You can group you purcahse price into groups (as few as possible) ie 0 to 50 is group one 50 to 100 is group two etc.
for each group get the percentage of warrenty purchases.
I would suggest $50 or $100 intervals
(note you already have the $1 interval and it is not too useful. neither is a $500 interval)
@aburr - logistic regression is not linear.  The sort of grouping you are suggesting drops a lot of information relevant to his question, although it is certainly a simpler approach.


I would prefer the most accurate solution over the simplest.
you could use least squares fit, the formula would only make sense over the range you have data for,

I make least squares formula to be

0.461222605 + 0.000063416 * price

so higher price gives slightly higher probability

@Deighton - that is linear regression, which he has already stated he did not want to use.

Although to be fair, the relationship being modeled is very weak - there is not much relationship between the two variables to begin with.


I have carried out a logistic regression of the data and charted the results with excel. The formula created is 0.4371 + 0.0002 * price which seems to be different from deighton's findings. Is what I have done accurate? data.xlsx
I am confused.
"that is linear regression"
I cannot see how a linear regression can possible give any useful information.
exactly what are the two variables which are used in the regression?
buy warranty on vertical axis and price on horizontal?
The y variable is not continuous hence the basic requirements for a least squares regression are not met.
It seems to me that the logistic regression at least has possibilities but I do not know if it has been done correctly because the steps used have not been given and the file does not load on my computer at least.
I think my grouping suggestion has merit It has been suggested that it drops information, which is not true.
The answers it does give depends on the details of how the groups are formed. Thus any output information it gives depends on the exact method but that is comparable with the reproducability of the input information (which is not great).
Colin - Deighton ran a linear regression.  A logistic regression will produce a curve.  So the coefficients that each produce will be different.

The formula to produce a logistic curve is no longer y = bx + a.  If you conducted a logistic regression, you'll need to create your predicted line differently.  

Here is some detail on graphing logistic regression results: http://faculty.vassar.edu/lowry/logreg1.html


I ran a logistic regression on my data. This created the logistic output data in the attached txt file.

Col 1 Money spent.
Col 2 If a warranty was purchased.
Col 3 Logistic probability.

When these probabilities are plotted in the attached xls file.

Y Axis Logistic probability
X Axis Money spent

The result seems to be very linear. Am I doing something wrong?

 data-output-logistic-acc.txt data.xls
If those are odds in the second column of your output, you are doing it correctly.  Note that logistic regression typically outputs log odds rather than actual odds - you need to figure out what that column is in order to graph is correctly.  If they are log odds, you'll need to convert them back to odds to graph it informatively.

But again, the relationship between the two variables is very weak.  That means you're not going to see a lot of curvature (for logistic) or slant (for linear) regardless.


I have realised that the relationship between the vars is weak.

How do I discover if my odds column are log odds or odds. Would it be possible to enter contrived data to see. If so what would be an example of the data I would need to input?
You should be able to check the output or documentation from whatever program you used - usually you have data labels at the top of your columns.  I am not sure why they are missing from yours.


I am using this website http://statpages.org/logistic.html if you click the solve button it will show all the column headings.


I have looked at all the column header data and it might as well be in a foreign language, because I have no idea what it all means.
Based on the examples they give, it seems these are actual probabilities.  


If they are actual probabilities then plotting x as money spent against Y as probability is appropriate. It seems to give a very linear plot. I understand that the relationship between amount spent and warranty purchased is weak but do you think what I have done is accurate, or could it be done better?
Your graph is as accurate an estimate of the relationship as is possible, given your data.  The relationship is quite weak - only about 2% of the variability in warranty purchases can be explained by the variability in money spent.



What does your low variability mean with ref. to my findings?

I have tried aburr's data bin method and I get near identical results as I do with the logistic method, see attached.
It simply means that sales do not predict warranty purchases strongly.

The two analytic approaches produce similar results because the relationship is so weak.  Virtually any visualization approach for this data will produce a nearly flat line.

I am concerned that you are trying to find an analytic approach for your data that will show there is a relationship; this is clearly not the case.  People who spend more money are only slightly more probable to purchase a warranty.

This does not seem at odds with academic research in this area.  There are many other influences on extended warranty purchase that are likely to have larger effects: http://www.tepper.cmu.edu/facultyAdmin/upload/ppaper_67861958760495_ESC.pdf
I know reasons were given not to use linear regression, but my point is that the linear regression model is valid over the range 0-500.  Buying a warranty is either 0 or 1,

you can see from by line of fit to the data

prob = 0.461222605 + 0.000063416 * price

that the basic probability is about .46

The effect of price is minimal, or even nothing considering sampling error.

at 500 the prob is still .46


Hi deighton,

I notice that for a $250 purchase the logistic method predicts a 48.71% chance of warranty purchase, the data bin method predicts a 48.54% chance and your method predicts a 47.71% chance. The first two methods seem very close but your method not so. Would you say that your method is more accurate than the first two?

Hi richdiesal,

I understand the relationship between the two data sets is weak but that is only one store. I have the figures for 21 stores In every case there is a greater % purchase of warranty v price. In some stores the difference between % chance of a warranty purchase is over ten% from low and high prices. We chose this set of data because it is the store where we have the most general sales information available. What is concerning me a the moment is that deighton's method disagrees with my two attempts and he clearly knows more about this subject than I. Are my methods flawed?
They are simply different analytic techniques, based on differently underlying assumptions about the data.  Logistic regression assumes you want to model the outcome as log odds (a linear relationship with log odds produces a curve when converted back to probabilities), while linear regression assumes you want to model the outcome more straightforwardly.  It does not mean either is correct or incorrect, and there is no way to mathematically prove either is "more" correct.  It is entirely based on reasoning through the various contextual issues related to the data's creation and current structure.

You should not expect statistical conclusions to be quite as accurate as I think you are assuming - even the analyses we've discussed are based on a pile of assumptions we haven't even talked about, e.g. homogeneity of variance, normally distributed data, etc.  We haven't even gotten into issues of reliability and validity, which add a new layer of complexity on top of this.


Hi richdiesal,

Why does aburr's suggested linear method disagree with deighton's. And why does the logistic method agree with aburr's.

Or put another way.

If someone was to ask you what are the odds of a person purchasing a warranty on a $250 item. And then give you the data from 5000 $250 purchases to test your answer what would you say?

My attempt at logistic says 48.71
My attempt at data bins says 48.54
deighton's method says 47.71
"My attempt at logistic says 48.71
My attempt at data bins says 48.54
deighton's method says 47.71"
"what would you say?"

I would say that, within the reliability of the data and the use to which the data would be placed,
They are all the same
Agreed with aburr.  If you'd like to determine the accuracy level of each method, you could look at the standard errors.  But I'd be surprised if there is less than a 5% band of error on any of these approaches.

Personally, if I were asked, I'd say 48.71, because I believe the assumptions underlying logistic regression best reflect the data.  But another statistician might disagree (and apparently, several do).


Thanks to all for being patient with someone with little to no knowledge of statistics.
from a body of raw statistics, the 3 values were pretty similar, basically we have all shown that the probability is a around 50%, or a little under 50%.  I doubt if any of the figures should be quoted to 1 decimal place, and not to 2 decimal places.

There doesn't seem to be much relation between price and likelihood of buying a warranty, I don't think it is meaningful to talk of anyting like 'the probability increases from 40.2 to 40.5' over any prive increase

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial