Solved

Understanding sales data

Posted on 2011-09-04
30
502 Views
Last Modified: 2013-11-13
Hi Experts,

Please look at the attached data file. Column 1 is if a customer purchased extended warranty with their purchase 1=yes 0 = no.  Column 2 is the amount of money the customer spent on their purchase. I would like to know based upon how much a customer spends what is the percentage chance of them purchasing extended warranty?  data.txt
0
Comment
Question by:DColin
  • 12
  • 12
  • 3
  • +1
30 Comments
 
LVL 9

Expert Comment

by:richdiesal
ID: 36481604
Hi Colin - this is a fairly straightforward application of linear regression, which summarizes the relationship in terms of the formula for a line (y = bx + a).  Since you want to predict whether or not they purchased a warranty, that should be your criterion (DV), while money spent will be your predictor (IV).  
0
 

Author Comment

by:DColin
ID: 36481670
Hi richdiesal:

I should have mentioned that I know nothing about statistics so you will need to dumb this down for me. One thing I notice about your reply is you suggest the answer is linear. It would seem to me, that the answer can not be linear because when a customer spends 0 money there is a 0% chance of them purchasing a warranty, but there can never be a 100% chance they will purchase a warranty.
0
 
LVL 9

Accepted Solution

by:
richdiesal earned 250 total points
ID: 36481734
Ahh, that's a good point.  You probably want logistic regression to model the outcome as a probability.

Unfortunately for your understanding of statistics, linear regression is the "easier" of these two - logistic regression is much more complicated than linear regression.

The basic premise behind logistic regression is that you are trying to predict a probabilistic outcome (0% to 100%) from a continuous predictor.  It creates graphs like the one here: http://en.wikipedia.org/wiki/Logistic_regression - which allows you to predict the probability of your outcome at any given level of your predictor.

You'd need a statistical analysis program to conduct this kind of analysis - SPSS, SAS, R, Minitab, for example.
0
 
LVL 27

Assisted Solution

by:aburr
aburr earned 150 total points
ID: 36481756
I do not see a linear relationship between anything
You can group you purcahse price into groups (as few as possible) ie 0 to 50 is group one 50 to 100 is group two etc.
for each group get the percentage of warrenty purchases.
I would suggest $50 or $100 intervals
(note you already have the $1 interval and it is not too useful. neither is a $500 interval)
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36481792
@aburr - logistic regression is not linear.  The sort of grouping you are suggesting drops a lot of information relevant to his question, although it is certainly a simpler approach.
0
 

Author Comment

by:DColin
ID: 36481872
I would prefer the most accurate solution over the simplest.
0
 
LVL 18

Assisted Solution

by:deighton
deighton earned 100 total points
ID: 36484567
you could use least squares fit, the formula would only make sense over the range you have data for,

I make least squares formula to be

0.461222605 + 0.000063416 * price

so higher price gives slightly higher probability


data.xls
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36484593
@Deighton - that is linear regression, which he has already stated he did not want to use.

Although to be fair, the relationship being modeled is very weak - there is not much relationship between the two variables to begin with.
0
 

Author Comment

by:DColin
ID: 36485092
I have carried out a logistic regression of the data and charted the results with excel. The formula created is 0.4371 + 0.0002 * price which seems to be different from deighton's findings. Is what I have done accurate? data.xlsx
0
 
LVL 27

Expert Comment

by:aburr
ID: 36485199
I am confused.
"that is linear regression"
I cannot see how a linear regression can possible give any useful information.
exactly what are the two variables which are used in the regression?
buy warranty on vertical axis and price on horizontal?
The y variable is not continuous hence the basic requirements for a least squares regression are not met.
It seems to me that the logistic regression at least has possibilities but I do not know if it has been done correctly because the steps used have not been given and the file does not load on my computer at least.
I think my grouping suggestion has merit It has been suggested that it drops information, which is not true.
The answers it does give depends on the details of how the groups are formed. Thus any output information it gives depends on the exact method but that is comparable with the reproducability of the input information (which is not great).
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36485200
Colin - Deighton ran a linear regression.  A logistic regression will produce a curve.  So the coefficients that each produce will be different.

The formula to produce a logistic curve is no longer y = bx + a.  If you conducted a logistic regression, you'll need to create your predicted line differently.  

Here is some detail on graphing logistic regression results: http://faculty.vassar.edu/lowry/logreg1.html
0
 

Author Comment

by:DColin
ID: 36485257
I ran a logistic regression on my data. This created the logistic output data in the attached txt file.

Col 1 Money spent.
Col 2 If a warranty was purchased.
Col 3 Logistic probability.

When these probabilities are plotted in the attached xls file.

Y Axis Logistic probability
X Axis Money spent

The result seems to be very linear. Am I doing something wrong?

 data-output-logistic-acc.txt data.xls
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36485297
If those are odds in the second column of your output, you are doing it correctly.  Note that logistic regression typically outputs log odds rather than actual odds - you need to figure out what that column is in order to graph is correctly.  If they are log odds, you'll need to convert them back to odds to graph it informatively.

But again, the relationship between the two variables is very weak.  That means you're not going to see a lot of curvature (for logistic) or slant (for linear) regardless.
0
 

Author Comment

by:DColin
ID: 36485348
I have realised that the relationship between the vars is weak.

How do I discover if my odds column are log odds or odds. Would it be possible to enter contrived data to see. If so what would be an example of the data I would need to input?
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36485394
You should be able to check the output or documentation from whatever program you used - usually you have data labels at the top of your columns.  I am not sure why they are missing from yours.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:DColin
ID: 36485550
I am using this website http://statpages.org/logistic.html if you click the solve button it will show all the column headings.
0
 

Author Comment

by:DColin
ID: 36485581
I have looked at all the column header data and it might as well be in a foreign language, because I have no idea what it all means.
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36485626
Based on the examples they give, it seems these are actual probabilities.  
0
 

Author Comment

by:DColin
ID: 36485701
If they are actual probabilities then plotting x as money spent against Y as probability is appropriate. It seems to give a very linear plot. I understand that the relationship between amount spent and warranty purchased is weak but do you think what I have done is accurate, or could it be done better?
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36485808
Your graph is as accurate an estimate of the relationship as is possible, given your data.  The relationship is quite weak - only about 2% of the variability in warranty purchases can be explained by the variability in money spent.
0
 

Author Comment

by:DColin
ID: 36486253
richdiesal,

What does your low variability mean with ref. to my findings?

I have tried aburr's data bin method and I get near identical results as I do with the logistic method, see attached.
data2.xls
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36486372
It simply means that sales do not predict warranty purchases strongly.

The two analytic approaches produce similar results because the relationship is so weak.  Virtually any visualization approach for this data will produce a nearly flat line.

I am concerned that you are trying to find an analytic approach for your data that will show there is a relationship; this is clearly not the case.  People who spend more money are only slightly more probable to purchase a warranty.

This does not seem at odds with academic research in this area.  There are many other influences on extended warranty purchase that are likely to have larger effects: http://www.tepper.cmu.edu/facultyAdmin/upload/ppaper_67861958760495_ESC.pdf
0
 
LVL 18

Expert Comment

by:deighton
ID: 36487137
I know reasons were given not to use linear regression, but my point is that the linear regression model is valid over the range 0-500.  Buying a warranty is either 0 or 1,

you can see from by line of fit to the data

prob = 0.461222605 + 0.000063416 * price

that the basic probability is about .46

The effect of price is minimal, or even nothing considering sampling error.

at 500 the prob is still .46



0
 

Author Comment

by:DColin
ID: 36488344
Hi deighton,

I notice that for a $250 purchase the logistic method predicts a 48.71% chance of warranty purchase, the data bin method predicts a 48.54% chance and your method predicts a 47.71% chance. The first two methods seem very close but your method not so. Would you say that your method is more accurate than the first two?

Hi richdiesal,

I understand the relationship between the two data sets is weak but that is only one store. I have the figures for 21 stores In every case there is a greater % purchase of warranty v price. In some stores the difference between % chance of a warranty purchase is over ten% from low and high prices. We chose this set of data because it is the store where we have the most general sales information available. What is concerning me a the moment is that deighton's method disagrees with my two attempts and he clearly knows more about this subject than I. Are my methods flawed?
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36489515
They are simply different analytic techniques, based on differently underlying assumptions about the data.  Logistic regression assumes you want to model the outcome as log odds (a linear relationship with log odds produces a curve when converted back to probabilities), while linear regression assumes you want to model the outcome more straightforwardly.  It does not mean either is correct or incorrect, and there is no way to mathematically prove either is "more" correct.  It is entirely based on reasoning through the various contextual issues related to the data's creation and current structure.

You should not expect statistical conclusions to be quite as accurate as I think you are assuming - even the analyses we've discussed are based on a pile of assumptions we haven't even talked about, e.g. homogeneity of variance, normally distributed data, etc.  We haven't even gotten into issues of reliability and validity, which add a new layer of complexity on top of this.
0
 

Author Comment

by:DColin
ID: 36489724
Hi richdiesal,

Why does aburr's suggested linear method disagree with deighton's. And why does the logistic method agree with aburr's.

Or put another way.

If someone was to ask you what are the odds of a person purchasing a warranty on a $250 item. And then give you the data from 5000 $250 purchases to test your answer what would you say?

My attempt at logistic says 48.71
My attempt at data bins says 48.54
deighton's method says 47.71
0
 
LVL 27

Expert Comment

by:aburr
ID: 36490356
"My attempt at logistic says 48.71
My attempt at data bins says 48.54
deighton's method says 47.71"
"what would you say?"

I would say that, within the reliability of the data and the use to which the data would be placed,
They are all the same
0
 
LVL 9

Expert Comment

by:richdiesal
ID: 36492298
Agreed with aburr.  If you'd like to determine the accuracy level of each method, you could look at the standard errors.  But I'd be surprised if there is less than a 5% band of error on any of these approaches.

Personally, if I were asked, I'd say 48.71, because I believe the assumptions underlying logistic regression best reflect the data.  But another statistician might disagree (and apparently, several do).
0
 

Author Closing Comment

by:DColin
ID: 36494043
Thanks to all for being patient with someone with little to no knowledge of statistics.
0
 
LVL 18

Expert Comment

by:deighton
ID: 36494264
from a body of raw statistics, the 3 values were pretty similar, basically we have all shown that the probability is a around 50%, or a little under 50%.  I doubt if any of the figures should be quoted to 1 decimal place, and not to 2 decimal places.

There doesn't seem to be much relation between price and likelihood of buying a warranty, I don't think it is meaningful to talk of anyting like 'the probability increases from 40.2 to 40.5' over any prive increase
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Before You Read The Article Please make sure you understand these two concepts: Variable Scope (http://www.php.net/manual/en/language.variables.scope.php) and Property Visibility (http://www.php.net/manual/en/language.oop5.visibility.php).  And to …
The CRUD Functions CRUD, meaning "Create, Read, Update, Delete (http://en.wikipedia.org/wiki/Create,_read,_update_and_delete)" is a common term to data base developers.  It describes the essential functions of data base table maintenance.  This art…
This video discusses moving either the default database or any database to a new volume.
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now