On a scale of 1 to 10, how would you rate our Product?
Many of us have answered that question time and time again. But only a few of us have had the pleasure of receiving a stack of the filled out surveys and being asked to do something with them. What could we try?
Certainly there are statistical methods to treat this sort of data, right? Perhaps a correlation study in order to find which question has the most influence on the overall satisfaction score? The sales manager has noticed that the overall score is 7.89, closest to that of satisfaction with his own department, scoring 7.97. That seems meaningful: doesn't it prove that sales is the most important factor in client satisfaction?
When searching for answers on the Internet, there doesn't seem to be any good introduction on methodology. Naturally, dozens of sophisticated methods are described at length with the most painful details, but it's never clear when and how to use them. And some words used on those pages are almost scary.
This article will show that the search for statistical methods is often misguided. One shouldn't look for parametric tests, but look for techniques of descriptive statistics and -- perhaps -- non-parametric tests. Furthermore, any useful interpretation will require more psychology and good sense than statistical expertise.
The first part describes what most people are likely to try: parametric statistics with little or no results. The second part shows that you don't need any special training or any sophisticated methods: the data will be interpreted using high-school arithmetic and good sense. The last part shows what a statistician can be asked to do with survey data -- surprisingly it tells more about the questions and the survey methodology than it does about the customers...
It is possible to do all of the calculations using a simple spreadsheet application, with the help of some on-line matrix calculator when needed. A specialised statistical tool is preferable, but not required. However, the methods themselves are not the topic of this article, it isn't a tutorial. It focuses instead on concepts and scientific methodology.
All links in the text are to the relevant Wikipedia pages, which contain the methods, perhaps with too much details. The contents aren't always ideal as a first introduction. At least the underlining signals the first use of any technical expression.
Finally, this article is rather long. Take your time, or skim through it first before digging into it.
We have obtained the first stack of surveys. Each contains six questions, and there is room for suggestions and comments. These will be treated separately, we are only looking at the "scores". Since we will be talking a lot about these numbers, let's give them abbreviations. Each question starts with the familiar "on a scale of 1 to 10, how would you rate...".
PROD ... our Product?
TECH ... the Technical Support?
SALE ... our Sales Department?
PRIX ... the Price of our Product [compared to ...]?
P.R. ... our Problem Resolution Process?
RECO ... would you recommend our Product?
If you like, please take some time to import it into your favourite spreadsheet software, database, or statistical package, and play with the numbers a little. Unless you have some special training or experience in the field, they will not start talking unless you find the right point of view. But you can "read" the surveys individually, and try to imagine the frame of mind of the customer who chose any given set of answers.
1. Parametric Statistics
what we might have learned in school Exploring the numbers
The first thing we are likely to do is to calculate some aggregate numbers, to get an overview of the data. The table below is obtained through "Descriptive Statistics" from the "Analysis ToolPack" add-on in Excel, and is typical for the sort of "first approach" calculations everybody tries.
The first rows confirm the nature of the data: scores are numbers from 1 to 10. The answers 1 and 10 themselves have been used at least once for each question. Many answers are missing, customers apparently didn't always have something to say about the sales department or the price.
The "largest / smallest 300" is obtained by sorting the surveys by a score, and looking at the sheet at the 300th position, from the top and the bottom of the pile. For example, 300 customers scored PRIX at 6 or below, and 300 others at 8 and above. The median is similar: it shows the score on the middle sheet, in a sorted pile. Similar statistics would be quartiles: the values on sheets around positions 1/4, 1/2 (the median), and 3/4.
The mode is simply the most frequent answer. It is a very useful statistic, in that it doesn't even require sorted categories. In the figure below, the true mode is "no answer", because that is the most frequent category, and eight is the mode among the answers. For a continuous distribution, the mode is the highest point of the curve: the orange curve has a mode around seven.
The next row indicates the mean (what an ill chosen name!) which is "meaningful" only when the distribution follows a normal distribution, or at the very least a symmetrical distribution. The number 7.72 is meant here as an estimate of the true mean of the underlying distribution, and is thus assorted with a standard error.
We have no such hypothesis, so the older name average (or technically the arithmetic mean) should be used -- if only to avoid even looking at the standard deviation and similar estimators. Here, the average is the sum divided by the count, which we already have. When the values average, median, and mode are in that order, it indicates an asymmetrical distribution, skewed to the right (negative skewness).
The figure shows the distribution of PRIX ("how would you rate the price?"). The mode, 8, is immediately visible, the median is lower, 7, and the average even a little more, 6.90. The overlayed bell-shaped curve is the normal distribution, showing how the customers would have answered if they all "meant" to answer 6.90 (some "objective" or consensual rating or the price), compounded by individual differences in opinion and other errors ("deviations" from the standard) worth 2.30 points. This idea is clearly absurd.
The calculated values below the mean all have a very precise definition and signification, but only if we have some hypothesis on the underlying distribution function. Since we don't, the numbers are mostly meaningless, including the popular mean and standard deviation.
The fundamental concept of parametric statistics is that the numbers under study follow certain laws, perhaps with some "errors" or the effect of additional unknown factors, and that these laws are mathematical expressions using a small number of parameters. For example, if the scores were normally distributed, the parameters of the normal law would be interesting: the mean and the standard deviation. In that case, we could use the average (the sample mean) as the best estimator of the true mean. Without that assumption, the mean is a rather uninteresting and quite possibly misleading number.
Looking for Correlations
For the same reason, the lack of an assumed or observed distribution function in the answers and thus of distribution parameters, all parametric tests will be meaningless, along with many promising numbers that we can calculate.
Let's chart TECH against RECO on an X/Y scatter plot, in an attempt to find and ideally to measure the correlation between the factors. The trivial chart is disappointing: it shows one dot for every existing combination of answers, but not how many there are. We will solve that below. For the moment, there is the intriguing option to add a linear trend-line, with a measure of the correlation. Such a line is the representation of a linear regression. It computes the best linear equation to predict RECO knowing TECH (technically the equation for which the sum of the squared errors of the prediction is minimised):
RECO = 0.58 × TECH + 3.10 ± 2.21 (r²: .190)
For example, if a customer rated TECH at 6, we should expect a RECO between 4.4 and 8.8 in about two third of the surveys. That doesn't seem very precise (or useful).
A coefficient of determination (r²) of 1 means perfect correlation, while 0 means absolutely no correlation. So this one is rather loose. The precise number is in fact totally irrelevant, because we haven't formulated the hypothesis that our data should follow a linear model, let alone tested for it.
We can also reverse the idea, and look for the best estimate of TECH knowing RECO:
TECH = 0.32 × RECO + 5.76 ± 1.65 (r²: .190)
If a customer answered 6 to RECO, we could expect a TECH score between 6.1 and 9.4 in the majority of the surveys...
The entire idea is in fact flawed for several reasons. Even if there was a more significant correlation, the equations above assume that the variables are measures (a dubious simplification) and that there is a causal relation between them -- in one direction or the other... A non-causal correlation shouldn't be estimated by a linear regression.
The figure shows a bubble plot, the surface of each bubble representing the number of surveys having a particular TECH/RECO combination. This is one way to visualise a table of occurrences. The two lines are the two linear regressions, predicting RECO from TECH and the reverse, crossing at the centre of gravity <8.3, 8.0> (the averages of both distributions). The lines should really be represented as broad stipes (± 2.21 and ± 1.65). The non-causal correlation line (the axis of greatest variation) lies somewhere between them.
For the moment, correlations don't seem very informative, especially if we can't use the scary looking numbers like r². But we can try one more thing: using five variables to predict the sixth. What is the best linear equation to predict RECO knowing the five other scores?
For example, the survey <1, 1, 1, 1, 1, ?> should have a RECO below 1.6, <5, 5, 5, 5, 5, ?> between 3.0 and 6.5, and <10, 10, 10, 10, 10, ?> above 9.1. This looks much better: it's closer to the diagonal (and to our intuition).
What's more, we can see that PROD influences RECO the most, followed by TECH and P.R., while SALE and PRIX have a comparatively weak impact. For example, if only one question scored high, we see that <10, 1, 1, 1, 1, ?> expects a RECO between 1.5 and 5.0, while <1, 1, 10, 1, 1, ?> expects one below 3.1. Is this meaningful at all? do we even have surveys like these?
There is one survey <1, 4, 10, 4, 1, 1> (expected 2.6), and another <1, 4, 9, 3, 1, 1> (expected 2.3). It seems indeed that SALE has little weight. On the other hand, we find surveys like <10, 8, 1, 3, 6, 2> (not 6.7), <10, 10, 1, 1, 8, 1> (not 7.4), or <10, 10, 1, 8, 9, 10> (in range: above 7.0). Our model isn't any better for PROD than it is for SALE.
The figure shows the estimated RECO for all surveys, based on the five other values, on the X-axis. The Y-axis is the actual score for the same survey. There appears to be some correlation, at least for some actual RECO values, but the score 'one' for instance seems to be highly independent from any linear prediction.
What could we use to examine correlations if we do not imply causality? There must be something better than linear regressions.
The Principal Component Analysis
A competent statistician would notice immediately that the scores are not measures. If he doesn't (and if you don't tell him), he might try the principal component analysis (PCA). It is after all "the simplest of the true eigenvector-based multivariate analyses" (ouch!).
Don't worry, we are not going to perform the analysis. But we need to understand roughly what it does and how we can (perhaps) use the results.
The idea, to someone really familiar with multi-dimensional geometry, is quite simple. Each variable is a coordinate. We have six scores, so we are dealing with a vector space in six dimensions. Nobody can visualise six dimensions, but we can accept that it is a valid mathematical space, and also that the more dimensions we have, the more ways there are to rotate something. The trick is to find the best "view" of the six-dimensional cloud, which means finding the direction in that space for which the variance is the greatest.
In two dimensions, if we have a cloud elongated diagonally, we will suspect there is a correlation. Instead of choosing one variable to predict the other (as is done in linear regression), let's find the line that best fits the cloud of points, symmetrically. This line is called the first axis. The remaining secondary axis, perpendicular to the first, shows whatever variance is left. The PCA would simply tell us to turn the paper by some precisely computed angle.
Technically, the PCA first computes the matrix of covariances, meaning all covariances between the columns including the variance of each column by itself, and then extracts some properties of that matrix called eigenvalues and eigenvectors. That is the only operation a normal spreadsheet program does not handle.
When this particular matrix is applied to the data points, each receives a new set of coordinates along six new axes, called the components. The data is rotated and stretched in such a way that the first axis shows the largest variance of the entire cloud. The data is then charted using the first two axes or combinations of the major axes, the principal components.
The hoped for result is that one or two charts will be sufficient to reveal anything interesting in the orientation or clustering of the data.
Along with the cloud of dots (each dot being one survey), we will also get directional vectors for each question. These vectors have a length of one unit, and will thus be enclosed in a one-unit radius circle. If the vector touches the circle, it's fully represented on the selected axes; if it isn't, we need to look at the other axes to interpret that particular question.
This is the graph for axes one and two of the PCA of the survey data. Surveys with missing answers have been eliminated (every survey must be exactly located in six dimensions in order to be displayed).
Wow! There is a pattern! -- but when we look closer, it isn't very interesting. The first axis separates "good" surveys and "bad" surveys. The dot on the far left represents the surveys <1,1,1,1,1,1>, and the dot on the far right the surveys <10,10,10,10,10,10>. In essence, surveys are arranged from left to right according to their "total score".
The second axis segregates the SALE score. Good SALE answers are at the bottom right. The diagonal concentration of dots at the lower right are the surveys with a SALE score of ten. The bands above are the scores nine and eight. The score five is just visible as another diagonal, and the score one is the sparse line of dots along the upper-right.
Nothing really fascinating here. We learned however that SALE shows the lowest covariance with other scores. In other words, the score for SALE is the most difficult to predict when all other scores are known. The answer to SALE is atypical.
The figure above only shows about two thirds of the total variance. For a good interpretation of the remaining variance, axes three to five would be meaningful, but that would be a 3D graph. Let's look at the graph of axes three and four.
The cloud of dots is disappointing: no pattern appears. If there was, it could have shown different types of customers, in aggregates. As it stands, there is not much to learn from the following axes.
The statistician tells us that the vectors displayed in the one-unit radius circle (zoomed in the figure) explain this particular view: customers exceptionally satisfied with the price are placed at the left and slightly to the top; those who were happy with the problem resolution process are at the bottom and slightly to the right; and those who liked the product or benefited from the technical service are at the upper right. He further insists that the next axis, number five, strongly segregates TECH and PROD, so that we should really imagine a tetrahedron, with four distinct directions used to place customer surveys in a three-dimensional space...
In layman's terms, there isn't anything more to say: the score of SALE is less correlated to other questions, as seen in the previous figure, and questions PROD, TECH, PRIX and P.R. have similar weights in customer variability.
In conclusion, PCA isn't a magic wand. There isn't much we already knew. Agreed, it's pleasant to confirm the "oddness" of SALE graphically, but none of this constitutes proof. We performed no test, there is no measure of correlation or non-correlation, there are no real answers. There are perhaps new ideas.
In the perspective of descriptive statistics (or exploratory statistics) the aim isn't to answer questions, but to collect new ideas by looking at the data in different ways. As such, the PCA isn't totally useless, it's only disappointing.
What have we learned so far?
There is correlation between answers. It is plausible that customers choose to recommend the product more or less in accordance with their general estimation of the product, and less so with its price or their personal experience with a given sales clerk. But that is just an idea, we can't begin to prove it.
The opinion about the sales department seems less correlated to other answers. Again, this isn't something we can measure.
All these numbers, just for that? Yes, and we actually learned more by looking at well constructed graphs than by using the numbers themselves.
The reason for this poor loot is that all the basic techniques deal with normal distributions and similar predictable or inferred probability distributions. Furthermore, they have been created in order to measure, estimate and test rather then to describe and explore.
Formally, the entire exercise was illegitimate. Parametric statistics and PCA are used when the variables are measures, not satisfaction survey answers (with their odd distribution pattern). But that doesn't matter too much: in descriptive statistics, everything goes; the only criteria is usefulness. In this case, however, we didn't learn nearly enough.
2. Direct Interpretation
what anybody can do with the numbers The Distribution of Satisfaction Scores
To extract meaningful information from the surveys, we need to go back to the basics. The data isn't linear. A satisfaction score is not a measure, it is an answer in a survey. It does have a distribution, but one based on semantics. There will be a mixture of typical behavioural patterns: "I'll put 'one' everywhere just to show them!"; "it was as I would have expected, let's put tens, it wasn't anybody's fault"; "I don't know, I'll put five"; "it was good but not perfect, so nine (I never put ten anyway)"; etc.
Humans are not machines. When asked a question, they answer using complex and subtle (or crude) semantics. Every number has a meaning.
Some people will want to answer 'yes' / 'no' / 'don't know'. This translates as ten, one, and five. Five isn't in the middle, but is perceived to be (nobody knows that the average of numbers 1-10 is 5.5, not 5). So, we expect a peak at these values, adding the meaning "average" to five as well.
Additional confusion occurs when a question can be answered with 'positive', 'negative', 'neutral'. For example, about recommending a product: "I will recommend it" is clearly ten, "... against it" must be one, so five is the 'neutral' answer.
Then there is "good" (not "perfect") and "bad" (not "horrible"). The second is easy, it's three. But the first is problematic, because now people would like to select 7.5. This creates a blurring between seven and eight. Often, eight means "good"; nine means "excellent".
The scores two, four, and six have little appeal and little meaning ("almost horrible", "slightly worse/better than average"). We should expect a deficit there.
Finally, nine often means "very good, but not perfect". Some people will never score ten on principle. So, to summarise:
We have six of these distributions. Let's look at them with that in mind.
It is suddenly much more readable. The low occurrences of two, four, and six are just as expected; the peak at five, and the seven/eight clustering is explained as well. The "no answer" counts have their own pattern, and don't seem to fit in the series.
Scores one and ten behave strangely. They are both a popular choice for SALE and RECO, but not necessarily for the same reasons. SALE=1 might mean "I wasn't offered a coffee" or "he had bad breath", while RECO=1 probably means "I never recommend products".
The PRIX distribution shows higher occurrences for scores three to five, and lower for nine and ten. One explanation would be that the price is too high, but there is another one (see below).
One indication I find very useful is the "ten effect". Notice that SALE and RECO have an "elbow" at score nine. This is partly explained by the fact that nine is globally a less popular answer than eight or ten, but PROD and TECH do not exhibit this effect. For lack of a better word, I will say that the latter are answered objectively, on a "scale of one to ten", while the former are answered subjectively (or emotionally), using the numbers as words to express something -- or to hide something.
Who is being rated?
This second psychological aspect or survey interpretation is perhaps less obvious, but it is essential in order to avoid false conclusions.
If you ask people to rate a product, they can be reasonably objective. Naturally, the survey is biased because we don't have any data on people who are totally uninterested in the product. The same is true for technical support: you get an objective assessment from those who needed their services.
However, most questions involve the subject (and thus become subjective). If asked about the price, some people will express something different: "I'm rich, I don't care about the price, hence ten!"; "I wanted to buy a cheaper present, so it's three"; "I'm not the one who's buying, so it's five". Most people however have compared some prices before buying, and will objectively rank the product among competitors.
Going back to the "ten effect" of questions SALE and RECO, possible meanings for the score ten could be "I came here to buy the product in he first place...", "of course I'd recommend it, I bought it, didn't I?", or "wait 'till Thelma sees this!". In all cases, the question being answered is "were you really all that smart when you bought this product?" The answer is "of course!"
For each question, it's good to ask oneself how objective (how subjective) the answers probably are. One more example: "would you recommend this product?" -- "I never recommend products, hence 'no'" (~6% of the customers answered 'one'...).
In fact, we have some "subjective" questions, and also some "subjective" scores, which could show some correlation. But I'm getting ahead, let's continue now with the straightforward interpretation.
Direct Interpretation of Occurrence Tables
The most useful visualisation up to now has been the graph or marginal distributions. They are called marginal because they are in the margins of double-entry tables called tables of occurrences or contingency tables. They can be interpreted without much knowledge or experience in statistics, and without formulating any hypothesis.
The main topic will be the relationship between various questions and what is perceived as the principal question "would you recommend our product". This in itself is a huge mistake in methodology, but let's leave that aside for the moment. The following tables all show RECO scores in columns, and the five other variables in rows.
Some cells have been highlighted to illustrate the commentary, not necessarily following any precise rule. (Most often, they isolate the mode(s) or the modal region -- the cells with the highest scores -- either two-dimensionally or in columns)
Note: The interpretations given below are probably difficult to follow unless you have played a little with the raw data. If this was your own data, you would already know many things about it, including external data not represented in the numbers. Since it isn't, you need some time to get acquainted with them.
PROD -- how would you rate our product?
I have categorised the variables as being respectively "objective" and "subjective". The mode of PROD is thus "objectively" (row) eight, rather consistently across columns. The mode of RECO is (column) ten, but that might be meaningless, especially with a secondary peak at eight and the strong presence of 'one'. In reality, the highest association is in the seven/eight blurring (trying to average 7.5). The true cross-score is simply "good/yes". Also notice the pattern in the first column and the mode of column three. The scoring of PROD is objective even when RECO isn't.
TECH -- how would you rate the Technical Support?
The score is really excellent. 90% of customers were happy, and this is reflected in all columns (all RECO scores). If you remember that some "bad" scores will reflect some other types of problems, the overall result is as good as it gets.
SALE -- how would you rate the Sales Department?
There is a problem. This isn't easy to see, but I suspect that anybody with experience in multi-variate analysis will feel there is something "odd". Both SALE and RECO are subjective questions, with a strong "10 effect". Let's grey out these scores, and notice how unexpected values and peaks occur in many cells. Notice the 15 in 1/1, the 25 in 8/5 and 6 in 8/1, and the general density in the upper right quadrant. Let's add a bubble chart for this one.
There are clearly too many peaks and sub-groups. If we eliminate mentally the "10/10 effect" (doubly subjective), two categories are left: satisfied customers around 8/8, and various pockets of disgruntled customers. Sales dissatisfaction (~20%) is much less correlated to recommendation.
PRIX -- how would you rate the Price of our Product?
Good correlation there, which makes sense. By looking at individual column distributions, we see the the mode follows roughly the diagonal (with insufficient data in the "bad/bad" categories). Price has a big impact (subjectively). Not as much as the product itself perhaps, but that was an "objective" variable.
P.R. -- how would you rate our Problem Resolution Process?
No surprise: large value in "1/1" (problem not solved: "I'm unhappy", "this product is useless") and "good/good", in the broadest sense.
Among the several unexplored combinations of variables, let's take a brief look at two more cross-tables.
PRIX / SALE
Apparently, sales satisfaction does not rely on price primarily. This is visible because the same pattern (or lack of pattern) observed in RECO / SALE above. One group of customers is satisfied with the sales department (around row 8), while others populate the upper right quadrant without a clear pattern. Apparently, there is a minority (the ~20%) who are unhappy about something. They are still able (mostly) to objectively rate the price, disregarding malcontent.
PRIX / PROD
As comparison, this table is "objective vs. objective", and shows no surprise. Both 1/1 and 10/10 are rare, and PROD can still be rated 'good' when the price seems steep...
I hope I have revealed the method. We are simply trying to find in the tables anything unexpected or unexplained, and explain it.
The Chi-Square Distance
There is a method to highlight the unexpected or the "different", but only if there is some hypothesis on the "expected" distribution. For example, if we suspect two variables to be totally uncorrelated, a chi-square test can be used to prove it. When the test fails (as it would in all the tables seen so far) we can look at the cells having the highest cell-chi-square values. A high value means the result is different from the norm, and thus "binds" the row to the column.
The expression chi-square designates in fact a probability distribution function. The idea is that an observed distribution can vary around an expected distribution, but only by so much. If the distance between the distributions (the chi-square) is too large, we must conclude that our expectation was incorrect.
In survey interpretation, we don't have any hypothesis, so we don't have any expected distribution. This means that the mere idea of a chi-square test seems misguided. However, the test uses a distance, and we might make use of that measure even if we do not test anything.
The table on the left is a simple occurrence table of all answers (without the "no answer" counts). If there was no correspondence at all between questions and scores, we would expect for example a count of 39 for the cell PROD/1:
Expected = 205 × 1271 ÷ 6744 = 38.64
The observed count is only 11, and the Chi-Square distance is calculated like this
Distance = (11 - 38.64)² ÷ 38.64 = 19.77
The colour coding shows whether the observed result is lower or higher than the expected value; the distance is the same but the meaning is opposite. All cell distances are cumulated, and the total Chi-Square value for the table can be used for an independence test. The value is much too high to be random; we already know that there is some correspondence between questions and scores.
The newly calculated numbers allow a new "reading" of the table. The highest Chi-Square values show a deficit in cell PROD/10 and an excedant in RECO/10. These are the most significant correspondences and need to be explained first. Then comes the high counts of PRICE/5 and PRICE/3, with deficits in PRICE/9 and PRICE/10. PROD/8 and SALE/10 should also be noted, before going into finer details.
These numbers create the strongest bounds between answers and scores, and we already have explanations for them.
The PROD profile is "objective", but it's been collected among customers, who are partially rating their own decision to buy the product. A rating of ten is something you give to your new shoes when you are twelve or a new album when you are sixteen. A rating of one would mean you shouldn't have bought the product.
The profile of PRICE shows large counts at five and below, and corresponding low counts for nine and ten. Customers are less happy about the price than about the product or the services. The total of the column shows that PRICE has the score distribution with the highest distance from the overall distribution.
The high counts of score ten for SALE and RECO are still a bit of a mystery. The explanation is probably semantic or psychological; the counts are in contradiction with those for six to nine. For comparison, TECH has a "honest" high count for the score ten and corresponding low counts for five and one. Row ten has the highest total distance, which means it has a highly unusual distribution across questions.
Let's stop here
By avoiding the pitfalls of "standard deviations" and "coefficients of correlations", and by looking directly at the distributions (with a little good sense and a little background in psychology), we were able to say some things about these surveys.
The product is "good". It has no major flaws, but can exhibit problems. When that happens, the "problem resolution" is satisfactory, and the technical support is excellent (in fact it's brilliant). However, the price, while mostly "correct", is too expensive for some customers. On the other hand, there is something odd about the sales department, but we can't say what exactly. Most customers would recommend the product (and the service?) to their entourage.
The logical step now would be to give a raise to the technical department and to read the comments on the surveys for the customers having problems with sales.
And that is about it. There isn't much more one can extract from the surveys. Just to be sure, let's ask a good statistician about any "magical wand" we might have overlooked.
3. Descriptive Statistics
what a statistician can be asked to do
Clearly, the surveys contain information. The answers are not measures, so parametric statistics do not work well to interpret them, but there is information. At the very least, we should get some help in interpreting occurrences tables.
The "principal components analysis" (PCA) used a covariance matrix to measure the differences between distributions. This is only valid if the variance itself has some meaning, which isn't the case.
Another less common multi-variate analysis, called the correspondence analysis (CA), uses the chi-square distances instead. It has some useful properties explained below.
To be fair, let's first use this method just like we used the principal components: by cheating. We had used the scores as if they were measures, we'll use them now as if they were counts. Formally, we can imagine that each customer had to grab one or two handful of marbles and was asked to distribute them to the "questions". In reality, there are boundaries -- at least one marble per question and at most ten -- but let's pretend it doesn't matter, and just count marbles.
In PCA, the first axis usually reflects overall "size". Remember that all negative surveys were on the left and all positive surveys on the right. In correspondence analysis, the "size" is not used directly -- it is used as "weight". So the first axis is like the second of the previous analysis, showing once more that SALE behaves differently from all other questions. This axis is not represented here. The figure below shows instead the next two axes, two and three.
The similarity with the second figure of the PCA, showing axes three and four, is striking. Although this chart is read and interpreted differently -- there is no central "crystal ball" -- the meaning is exactly the same. Surveys tend to segregate first according to the SALE answer -- not shown here, and then according to the other answers, with no clear priority.
It should be noted again that the analysis could have revealed groups of surveys with contrasting answer patterns. Conversely, if groups existed beforehand (different stores, different variants of the product, different types of customers) the differences between them could have been used in the interpretation.
Using this type of analysis blindly doesn't add anything to the interpretation.
The Correspondence analysis
This method is well known in Anthropology and Natural Sciences (I have seen applications in archaeological typology, literary analysis, sociology, veterinarian science, demography, political anthropology, archaeozoology and botany). What makes it so useful is that it doesn't require measures, it works with simple counts. However, it is still relatively new, and often misunderstood.
Much like PCA, Correspondence Analysis (CA) attempts to visualise a multi-dimensional cloud of dots, each representing a count of occurrences. Unlike PCA, which relies on covariance -- implying that the variance itself is meaningful for all variables -- CA uses only a measure of distance between distributions -- whatever that distribution may be -- and a measure of weight.
Let's look again at the graph showing the marginal distributions. The data is that from the chi-square table a few screen above (the chi-square will be used as a measure of distance).
To a statistician, it is quite natural to take also a quick peak at the equivalent representation of the questions distribution for each score. In Excel, this simply means to switch between "data in columns" and "data in rows".
Both graphs show exactly the same information (check it), but they "read" differently. The score ten behaves very strangely, even denting nine a little. Six, seven, and eight seem to agree, but five shows a slightly different pattern. One is often out of place, etc.
This time, the questions have no logical order. We could produce several hundred other graphs just by rearranging the questions, each "reading" a bit differently. Luckily, that is one other aspect of CA: it doesn't require ordered categories. Instead, it will produce the most logical ordering from scratch.
Again, this isn't the place to go into technical details, but an overview is needed. The data is first normalised and transformed to frequencies and weights. An average score profile (and an average question profile) is computed, and the distance between each score (or each question) is calculated using a statistical measure called the Chi-square distance. In that new multi-dimensional space, eigenvectors are again used to find the direction of the highest "inertia" -- the direction in which individual dots have the highest combined distance and weight. The dots can be represented on a graph, again choosing the axes revealing the largest "inertia" (similar to the largest variance).
In less technical terms, each question has ten frequencies (the frequencies of answers for that question), meaning ten coordinates in space. The eigenvectors are specific directions in that same space, and the eigenvalues indicate how widespread the six answers are along each axis. We can then chart the questions using the new coordinates along these axes, those with the highest spread. Similarly, each score has six frequencies (the frequencies of questions for which this answer has been chosen), meaning again six coordinates in space. This time, they will have different weights (some answers are more frequent than others overall), and this is also taken into account when computing the new coordinates.
The numbers used in the computations are the counts of occurrences in each cell. The fact that some of the headers are also numbers, the scores, is of no importance. They can be replaced by colours, for example, or simply by the words "one", "two", etc. If they align or align partially, it will be a reflection of the occurrences, not of the scores themselves.
One important property of the method is that it's symmetrical regarding columns and rows. When the analysis is performed with the occurrences table flipped on its side, the same eigenvalues are obtained, with correlated eigenvectors, and the new axis coordinates are entirely compatible. This means that the same graph will show both column headers and row headers, organised in space.
The graph shows the first two axes of the correspondence analysis, totalling 93% of the inertia (or of the "information"). Some finer points become visible in the third axis, but I won't try 3D graphs here. The beauty of this graph is that it interprets questions using scores and scores using questions. You will see that it synthesises all observations we can make on the marginal distribution charts above. It is, in fact, the same information once again, presented differently.
Note: The column "n/a" was not used in this analysis; in other words, it has no "weight" in the figure, it's been placed as illustrative variable only. The same analysis with the column showed mainly how special its distribution is, with its strong link to SALE and PRIX (something we already know). It made the remaining information hard to read, and since we are mainly interested on the correspondence between actual scores and questions, it was best to discard that information temporarily.
The first surprise is the proximity of both ends: one and ten. This simply means that they are both popular choices for certain questions, namely RECO and SALE. Let's circle counter-clockwise. Two has little associative weight, but we noticed that three and four are found a little more frequently as answer to PRIX, creating the upper right group. The fact that five is "pulled back" towards the right expresses its semantic proximity with one and ten, and also shows that it isn't such a good answer, perhaps less so than four. Finally, the sequence from six to nine, or even from three to nine if five is skipped, is a simple progression from "bad" to "excellent". They found their relative positions from PRIX, P.R., PROD, and TECH, which contain enough objective answers to make this work.
PROD, the Product, is really just "good". The eccentric position is due to the distance from one and ten answers, and also from three and four answers. It was rated fairly and objectively.
TECH, the technical support is "excellent", we knew that already. It serves as a bridge from nine to ten (because ten can also mean "perfect", not just "OK").
SALE, the rating of the sales department, is vastly unknown. Being drawn by both ten and one, all we can say is that opinions seem to be strong. More about that just below.
PRIX, the rating of the price, is much worse that we had found out up to now. Of course, many customers used eight as answer, but not significantly more than for other questions. The proximity to three and four doesn't mean that these were the most frequent answers, but only that this question had much more of these answers than any other question. Although the global rating is "good", this graph shows that a significant minority feels the price is too high.
P.R., the problem resolution process has not much to add. In fact, its distribution is very close to the average distribution (at the centre of the graph), so its inertia is very low. In other words, there are no surprises
RECO, the question about recommending the product, is like SALE, vastly unknown. This merits a little more explanations, and a hypothesis.
One important fact has maybe slipped by unnoticed. The analysis was made on total counts of answers per question. This aggregation has lost all form of association between the answers. If every ten in RECO was matched by a one in SALE, and vice versa, they would be negatively correlated, for example. Since the links between the answers have been broken, all we can say about the proximity between RECO and SALE is that the same answering profile was used, not that they are otherwise correlated. Likewise the proximity between ten and one doesn't mean they often occur together, but only that they are used most or least in similar questions.
Refining the Analysis
We expect the graph above to have quite a success during the presentation to management. It shows many unexpected facts about the surveys, it allows to address problems both in the method and in the results.
However, the boss isn't happy.
He is the kind of manager who, despite being rather intelligent, has the annoying tendency to simplify everything. He says "this is all very nice, but it doesn't show me what I need to know". He goes on explaining that there are only three scores: green, yellow, and red. He wants all customers to tick ten (or nine -- he does understand some psychology), so these are green. Scores between six and eight are tolerable: yellow. Five and below is not acceptable, action needs to be taken, so that's red.
He also doesn't like the six questions. He can affect the product (research and development, quality control), the price, and the service. He wants to know how these three poles are positioned and scored "on a scale of red to green". Could we please do that?
In terms of an occurrences table, this means grouping. Instead of ten rows, we will have only three, the rule has been offered. After some debate, the grouping of questions are PROD and RECO for "PP -- the product", PRIX alone for "$$ -- the price", and the remaining three questions for "SE -- the service". A table of three by three.
Only nine numbers (besides the totals). That must be extreme descriptive statistics. The chi-square distances at the right show there is some correspondence: the "pricing" column will determine the major axis, having the highest values (way too many "red" surveys, not nearly enough "green").
It's perhaps extreme, but the method works mathematically. We obtain exactly two axis (one dimension is always lost in CA), and the coordinates of six points, the row and column headers, on a graph.
That's exactly what the boss said he wanted, but we can already hear the reaction "cute, but what exactly am I looking at?". So let's take an initiative, and project on the same plane the information of the details in each group.
This is a slide we can talk about for a few minutes. It shows the same points as before, but from a slightly different angle. The "green dot" has clearly captured scores 'ten' and 'nine' (and thus erased their differences); the "yellow dot" is at the centre of gravity of scores 'six' to 'eight' (remember that 'six' is a rare score); the "red dot" attempts vainly to assemble all other scores, but fails to represent 'one' meaningfully.
The "pricing group" is of course equivalent with PRIX (a one-member group); the "service group" is exactly between variables TECH, SALE, and P.R.; the "product" group finds the middle point between PROD and RECO (a rather artificial construct).
This new point of view has masked some important differences in the distributions, in particular the "objective" versus "subjective" axis. By doing so, the associations that are meaningful to the boss have become more visible, and easier to talk about. Other groupings would produce other angles. I won't debate the formal merit of this attempt, but it provided a good example.
The Beauty of the Correspondence Analysis
There are a few aspects of the method which didn't find their place in the narrative above. The analysis produces additional sets of numbers, called the contributions and the squared cosinus, which help in interpreting the graphs. I used them implicitly in a couple of places without elaboration. In a real or more complex analysis, these numbers would have to be exploited consistently.
The second aspect has been demonstrated but not explained: the ability to show explanatory variables together with the main variables used in the analysis. This can take two forms. If some rows or columns have been grouped, they can be "ungrouped" in the graph, without changing the current rotation or view. The "group dot" will be at the centre of gravity of the "ungrouped dots". Entirely new data can also be injected, provided a similar distribution is available. This is what I did with the "n/a" distribution. Although uninteresting as major distribution in the analysis, it was still possible to plot it among the others.
Finally, it is possible in use the analysis with yes/no answers, something which hasn't been demonstrated here.
I still remember the first graph I saw produced by this method. It displayed -- on the same graph! -- archaeological digs and types of pottery. It was immediately readable, it fit perfectly with the little I knew about the data, and even better with the authors conclusions. I knew immediately that I needed to master this technique, not quite believing it yet. At the time, I had to decipher a FORTRAN library to learn about eigenvalues and eigenvectors, but it was worth it -- I wasn't disappointed.
I do believe it requires a fair amount of investment. It's not something you can just learn and apply. This is why I titled this section "what you can ask a statistician to do". It isn't extraordinarily complex, but it isn't a cookbook recipe. This is perhaps why the resources available are very scarce in English, and only decent (but more numerous) in French. The French name is « analyse factorielle des correspondances », or AFC, but even that page is still under construction at the time of writing.
Used properly it is a fantastic tool to explore, exploit, and present contingency tables, which occur in so many fields. For some reason, I find the elegant handling of tiny tables most beautiful.
what we should have known from the start
There are really only three things to remember: know your subject, trust your ability to read numbers, and use descriptive statistics instead of parametric statistics.
Questions about Questions
All we wanted to do was to exploit the answers, in effect rating various things. In order to do so, we ended up rating and interpreting the questions themselves, the scoring values, and the customer psychology. This looks convoluted and needlessly complicated. However, there is no way around it.
The information collected by a survey depends heavily on its quality. Creating good survey questions and techniques is an academic branch studied in sociology, economics, and politics. Naturally, the media and marketing consumes a lot of these specialists as well. Maybe it has become clear what makes it so complicated.
In the data used for this article, we discovered that "on a scale of one to ten", the answers one and ten are polysemous, along with five, to a lesser degree. They are sometimes part of the scale, for objective questions and fair customers, but are also used to mean something else than "very bad" and "very good".
We also discovered that a meaningful minority of customers needed or wanted to answer "outside of the scale" to express their feelings or opinions about the sales department and to answer the question about recommending the product.
I believe that the mysterious grouping ('one', 'ten', 'RECO' and 'SALE') on the correspondence analysis means that the questions triggered polysemous answers. This hypothesis could be tested by creating variations of the survey. If a question too general, it should be split into smaller questions; if it is unclear, it should be rephrased; if it elicits many 'one' and 'ten' answers, perhaps it should be a 'yes', 'no' question?
What happens if new options are added? For example a box for "not applicable, don't know, don't care"? or an additional check box labelled "and thank you!" next to the scale. What becomes of RECO if it's rewritten as "would you recommend our product? (yes/no); if yes, please rate on a scale of one to ten...".
A totally different hypothesis could emerge is a sociologist studied the survey process. If the survey is always conducted by a member of the sales department, it totally changes the meaning of the question. The customer reads "how would you rate the sales department", looks up, sees the employee smiling (or picking his nose) and the question becomes "how would you rate me?"
Continuing the discussion would go beyond the scope of this article. But it's good to have some of these ideas in mind when interpreting any survey, if only to keep the ability see the questions from the customer's point of view.
Descriptive and Exploratory Statistics
As I have tried to show, there is no magical statistical wand to extract answers from a bunch of numbers. Especially when these numbers are the result of human interactions, and not physical measures. What techniques are there in the end?
The main technique is counting. Of course, the computer does that for you, but you need to remain conscious of the fact. That is the difference between the mean and the average, if you like. The mean seems to be a property of any group of numbers, a parameter of a distribution, while the average (to me anyway) designates the operation of summing up and dividing among all. We have added scores, does that make sense? are two fives worth a ten? Then we have divided. Do the answers two, nine, and ten really mean the same thing as three sevens?
The second is reading. The data is made of numbers, but it reads. "Eight customers rated the product at one. Among those, three actually rated everything at one, but five were very pleased with either the technical support or with the sales department." This is informative. It shows that in some cases, 'one' means "it didn't work at first, I had to come back for...".
And there is no third method. Correlation coefficients, confidence intervals, standard errors, skewness, standard deviation, analysis of variance, multi-variable linear regression, coefficient of determination, kurtosis, normal test, nothing helps.
Many of them are essential for more serious surveys, trying to anticipate election results, planning complex and expensive marketing strategies, answering fundamental questions in psychology or sociology. But not for customer satisfactions surveys.
The only "almost magical" technique is the correspondence analysis. But it has more to say about the survey itself than about the things it tries to evaluate, and it only helps to show graphically what you already know or guess about the distributions.
In the end, the most common criteria for the choice of a method or a graphical representation is "what will look best for my presentation next Monday? what will support my opinions in the most drastic manner?"
The origin of the article was a couple of questions here on EE. My answers were much shorter, of course, but some sentences have survived the cut-and-paste operations and are reproduced here exactly.
I would like to thank the Asker, Bill B., for his enthusiasm and encouragements when I thought about making it an article. I might not have taken the step without him. Given the number of surveys one is asked to fill as customer, it is possible that this article could benefit a larger audience. In the end, it was much more work than I had anticipated; it must be the longest article on EE so far.
At first, the article didn't contain any multivariate analysis. It focused on "what doesn't work" (parametric statistics) and "what works" (knowing the subject and reading the tables). Once I started really playing with the data -- I tend to do a CA on any table that falls into my hands -- it was too late.
If I ever write a tutorial on these methods, I will add a link here. In the meantime, please ask questions in an appropriate topic area if you need help implementing them (in a spreadsheet) or using them (in a statistical package).
I hope you found something useful, and enjoyed the pictures.