- Community Pick
- Experts Exchange Approved
- Editor's Choice
Introduction
Access provides functions for computing several common descriptive aggregate statistics for column data, offering such aggregate functions as Count, Sum, Min, Max, standard deviation and variance (StDev, StDevP, Var, and VarP), and average (Avg). In addition to these typical aggregate functions, Access also offers the so-called "domain aggregate" counterparts for each of the aggregates already mentioned.
In a recent article, I described how to extend that list to include the median, mode, skewness, and kurtosis statistics.
However, none of these statistics tell us anything about how a paired columns of data relate to each. This article addresses that gap by demonstrating how to use Microsoft Access to calculate the covariance and correlation between a pair of columns.
The next two sections will take covariance and correlation in turn, and demonstrate how to perform the computations first using native Jet SQL, and then using the new Visual Basic for Applications (VBA) functions that provide "domain aggregate"-like capabilities for covariance and correlation and are included with this article.
The concluding sections of this article will contain:
- The source code for these new user-defined functions
- Provide sample files and instructions for implementing the new functions in your Access projects
- Suggest additional resources for more rigorous statistical analysis
Covariance
The covariance statistic measures how much two random variables tend to change in tandem.
- Two completely independent variables will have a covariance of zero
- Two random variables that tend to increase or decrease in tandem will have a positive covariance
- Two random variables that tend to increase or decrease inversely (i.e., as one increases the other decreases, and vice versa) will have a negative covariance
Covariance is determined as follows:
In native Jet SQL, you can calculate the covariance of two columns in a single table using aggregate functions. The following expression is adapted from the query "qrySet1_SQL_CorrCov_All" in the sample database:
The SQL statement above computes the covariance using all of the data from columns X and Y in the table Set1. However, there are natural subsets in that data:
- There are columns for Code and Section
- Thus, we may want to compute the covariances in such subsets as for each Code value, each Section value, or the combination of Code and Section
To get these subset results, simply modify the query to include the subset columns in the SELECT clause, and to use a GROUP BY clause with those subset columns. The examples below all demonstrate this technique, and are also adapted from several queries available in the sample database:
For users accustomed to writing complex SQL statements in Access, none of those queries are particularly difficult.
Users who prefer to mitigate the risk of making a mistake in the formula may instead want to use the new function DCovariance provided here. Substituting the DCovariance approach, the queries above become a bit simpler, as follows:
The DCovariance function takes the following arguments:
- X_Column is the name of the column containing the data for the first random variable
- Y_Column is the name of the column containing the data for the second random variable
- Tbl is the name of the table or query containing your data
- Criteria is an optional argument containing any criteria used to select subsets of your data. Be sure to use single quotes to qualify text values and # to qualify dates in the Criteria string
For all four arguments, I strongly recommended that you place the column/table/query names in square brackets. This is mandatory if your database object names violate the usual naming rules, such as by including spaces, punctuation, keywords, etc.
The Source Code section to this article contains the VBA source code for the DCovariance function, and the sample database provided in the Sample Files section illustrates several use cases each for calculating covariance using SQL and using DCovariance.
Correlation
The correlation statistic is another measure of the dependence between two random variables. This article's scope covers the commonly-used, so-called "Pearson correlation", which indicates the linearity of the relationship between the two random variables:
- The Pearson correlation has a minimum value of -1 and a maximum value of 1
- The absolute value indicates the goodness-to-fit for a linear model relating the two variables: an absolute value of 1 indicates a perfect linear relationship (no observed residuals)
- A positive value indicates that the two variables tend to rise or fall in value in tandem
- A negative value indicates an inverse relationship: as one variable rises, the other tends to fall, and vice versa
- A correlation of zero indicates that there is no apparent relationship, and that the two variables are independent
Correlation is calculated as follows:
As examples, the scatter plots below depict the data for the two sample sets available in the sample database:
In Set1, X and Y have a correlation of approximately 0.75, indicating a strong, positive linear relationship.
In Set1, X and Y have a correlation of approximately -0.66, indicating a fairly strong, negative linear relationship.
Before proceeding, two notes of caution about correlation:
- One of the enduring truisms of statistical analysis is that "correlation does not imply causation". Even if X and Y are highly correlated, it does not follow that X necessarily causes Y, or vice versa
- Particularly over small data sets or over a compressed range, two variables may appear to have a strong Pearson correlation, and yet the data may not have a truly linear relationship. A dramatic example of this is Anscombe's_quartet, a series of four modest-sized data sets that have identical descriptive statistics such as mean, variance, and correlation, but when graphed are easily seen as having very different properties (and not necessarily linearly related, either)
In native Jet SQL, as with covariance, you can calculate the correlation of two columns in a single table by means of aggregate functions. The following expressions are adapted from various queries in the sample database. As with the examples for covariance, to calculate correlation for subsets of the data, use GROUP BY clauses to define your subsets.
Again, while using such SQL statements is not beyond the means of experienced Access users, the DCorrelation function offers a convenient way to avoid having to remember and write the necessary SQL expressions, thus simplifying the task for less experienced Access users.
Rewriting the queries above to use the DCorrelation function yields the following SQL statements:
The DCorrelation function takes the following arguments:
- X_Column is the name of the column containing the data for the first random variable
- Y_Column is the name of the column containing the data for the second random variable
- Tbl is the name of the table or query containing your data
- Criteria is an optional argument containing any criteria used to select subsets of your data. Be sure to use single quotes to qualify text values and # to qualify dates in the Criteria string
For all four arguments, I strongly recommended that you place the column/table/query names in square brackets. This is mandatory if your database object names violate the usual naming rules, such as by including spaces, punctuation, keywords, etc.
The Source Code section to this article contains the VBA source code for the DCorrelation function, and the sample database provided in the Sample Files section illustrates several use cases each for calculating correlation using SQL and using DCorrelation.
Source Code
Below please find the VBA source code for the DCovariance and DCorrelation functions.
Sample Files
This article includes two sample files:
Access database
- Two tables, each with a set of data for two random variables
- 16 queries, demonstrating how to compute the covariance and correlation statistics for each set of sample data, as well as for various subsets, using both native Jet SQL and the new user-defined functions
- VBA source code for the DCovariance and DCorrelation functions
Excel workbook
- Two worksheets corresponding to the two tables in the Access database
- Computed statistics corresponding to the various queries in the Access database
Implementing DCovariance and DCorrelation in Your Access Projects
To implement the DCovariance and/or DCorrelation functions in your Access projects, simply copy the source code listed for the desired functions(s) in this article, and paste the code into a regular VBA module in your Access project.
Once you have added the code to your Access project, you will be able to use the functions you added in queries, forms, reports, and in other VBA procedures in your project.
Notes:
- To reach the Visual Basic editor from Access, use Alt+F11
- Be sure to post the code in a "regular" module, and not a class module, form module, or report module
- Make sure that the module name does not match any of the sub or function names in that module
Additional Resources
While Access has no native functions for correlation or covariance, Excel has native functions for both. The workbook posted in the Sample Files section demonstrates how to use the CORREL and COVAR functions to compute correlation and covariance, respectively. Please note that, to compute the subset correlations and covariances, I used "array formulas", as the ranges for the two variables were not contiguous.
In addition to Excel, any decent statistical analysis package will calculate correlation and covariance. For my important statistics work I prefer Minitab, but SAS, SPSS, and Mathematica will all perform well.
In addition, FMS, Inc. markets the Total Access Statistics add-in, a comprehensive statistical package that like the statistics packages above goes far beyond simple covariance and correlation. Several Experts here at Experts Exchange recommend that product.
Acknowledgments
Originally, the SQL examples and the VBA code for this article used correct, but sub-optimal, calculation logic to compute the covariance and correlation statistics. Access Expert and Page Editor harfang pointed this out to me, and he was kind enough to also supply an optimized calculation method. (My original logic used a sub-query to generate averages; the new logic dispenses with the sub-query and makes all the calculations in a single pass.)
=-=-=-=-=-=-=-=-=-=-=-=-=-
If you liked this article and want to see more from this author, please click here.
If you found this article helpful, please click the Yes button near the:
Was this article helpful?
label that is just below and to the right of this text. Thanks!
=-=-=-=-=-=-=-=-=-=-=-=-=-
by: younghv on 2010-03-25 at 16:49:56ID: 11868
Great stuff!
It brings back memories of my college Statistics courses.
Yes vote above.