Simple (?) Correlation Matrix in SQL

Posted on 2011-09-28
Last Modified: 2012-06-27
Hello Experts,

I have responses to a quiz sitting in a table something like

Student_ID   Question_ID   Correct(Y/N)

I would like to correlate the likelihood of a question being answered correctly to all of the other questions and their likelihood of being answered correctly. My friend recently mumbled something about "correlation matrix" and web reading makes the pearson coefficient sound useful but I don't know much about statistics.

In the long run, I want to be able to ask a series of questions about any topic which are then graded either Correct or Incorrect (no middle ground) and then be able to make predictions about other questions in the set a person may be likely to get right or wrong (e.g. you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

Rather than me making assumptions about topics and concepts that go together, I would like to be dynamically gathering data and understanding how students connect different concepts (and yes hoping to find lots of surprising connections!).

In a perfect world I would like to do this all in SQL Server itself without having to move data back and forth into statistical packages - ideally the model can keep building based on new inputs and adapt over time.

Question by:Split_Pin
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
LVL 50

Expert Comment

ID: 36715719
not clear how you'd do this in pure t-sql...

however this (to me) sounds like neural net processing , or Expert Systems processing to some extent.

you may also wish to look into Baye's theorem (new scientist july 2011), i don't know if its related to the pearson coefficient...

good luck
LVL 50

Expert Comment

ID: 36715732
you may wish to assign the question to some maths/statistics topic areas or seek moderator assistance...

Author Comment

ID: 36715995
Thank you lowfatspread

Yesterday I found an interesting article about how to apply the pearson coefficient in SQL which has encouraged me though I am not 100% sure I am doing it right...

Hoping that by trolling here I might hit a stats person who dabbles in SQL or vice versa.
I will check out the article you suggested on Bayes theorem.
Cheers, Split_Pin

Database Solutions Engineer FAQs

In this series, we will discuss common questions received as a database Solutions Engineer at Percona. In this role, we speak with a wide array of MySQL and MongoDB users responsible for both extremely large and complex environments to smaller single-server environments.

LVL 50

Expert Comment

ID: 36716050

not sure how you translate your yes/no values for the calculations ....

and the sql shown , to me is slightly suspect as the left outer join used is then ignored by not allowing for nulls in the where clause... so why wasn't an inner join specified in the first place...

i'm still unclear as to how you will distinguish between the flour and water based groups ... or is that supposed to be external to the process?

        user1, user2,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
                n1.user AS user1,
                n2.user AS user2,
                SUM(n1.rating) AS sum1,
                SUM(n2.rating) AS sum2,
                SUM(n1.rating * n1.rating) AS sum1sq,
                SUM(n2.rating * n2.rating) AS sum2sq,
                SUM(n1.rating * n2.rating) AS psum,
                COUNT(*) AS n
                testdata AS n1
		testdata AS n2
	ON =
                n1.user > n2.user
		n1.user, n2.user) AS step1
        r DESC,
        n DESC

Open in new window

LVL 50

Accepted Solution

Lowfatspread earned 500 total points
ID: 36716061
you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

you maybe better off considering a language like prolog which would allow you to state you facts  and assist in inferring further relationships...

Author Comment

ID: 36716151
I am not saying that prolog doesn't sound totally awesome but ...

Say my data set looks like this (I am student number 1, and possibly from another planet as my general knowledge seems rather limited).

Student_ID        Question_ID       Question Text                                               My Response             Correct(Y/N)
1                        1                       Bread is made from flour true or false              TRUE                      1
1                        2                       Cakes are made from flour true or false      TRUE                      1
1                        3                       Water is flammable true or false                      TRUE                      0
1                        4                       Ice is a type of gas true or false                      TRUE                      0

What I am hoping for is a SQL function/procedure/query that will calculate the correlation between every question. As more students try the quiz, we might find there is a stronger correlation between certain pairs of questions (or maybe larger combinations too) being answered correctly (or incorrectly) than others.

Incidentally a friend tells me something like this is used to evaluate exam responses to detect cheating (on the basis that there are certain things you should get right if you got other similar things right - however I don't want to superimpose what I think is "similar" - I want the raw data to tell me knowledge combinations that the study population happens to know coincidentally

At this stage in the game, the program does not have to interpret the question, the data set shows if it was answered correctly or not.

I guess worst case scenario is download some stats packages and run the matrix offline periodically.

PS: Thank you for coming this far with me in the discussion -  Its further than I got with my work buddies!!
LVL 50

Expert Comment

ID: 36716240
help me

what value do you assign an individual question answer  to map into the "rating"  for the sample sql given so far?

you want to correlate a question to all the others...
and/or groups of questions?


Author Comment

ID: 36716282
Playing with this yesterday, I plugged the field "Correct" into that query in place of rating - this field had either a 1 or 0 value, and it seemed to work.

What I think it was returning however was the overall correlation between each respondents (correlation between students, not the questions - or in the example from that site, the reviewers not the movie).

To start with, I would be thrilled just to get the correlation between every possible combination of questions. This by itself would let me arrange questions in order from strongest to weakest connection to existing student knowledge.

LVL 50

Expert Comment

ID: 36716341
ok i wasnt sure a 1/0 would work ... divide by zero etc ...  ( wondered about 1/-1 or 0.5/-0.5)

let  me see if i can do something... do you have a larger sample dataset available for testing (say 100-500 rows , 10-20 students ?) that you could post

my response would be a couple of hours at least <other commitments...;-)>

Author Comment

ID: 36716459
I've attached a simple demo I made up yesterday, 10 Students answering 10 questions.
The actual responses are randomly right (1) or wrong (0), the questions have no actual text associated to them at all.

The data is in Table called Sample1

I also have 3 queries in here where I was trying to bend the pearson stuff to suit my needs (and I think I sort of make it show the best question for each student to revise, based on student to student correlation and identifying questions "I got wrong but similar students got right"... At least thats what I was taking a wild stab at.

If you can come up with anything based on the table that correlates question to question that would be awesome (no rush, this is a long term project I am only starting to finally devote time to - no doubt you are very busy too). I realise the sample set is not large enough to provide any reliable conclusions, but at this stage the mechanism is the important part, which I can then put into real testing with a meaningful topic.

Once again, thank you for coming this far with me!
Cheers, Split_Pin
LVL 50

Expert Comment

ID: 36716485
i dont have access any chance of a text/csv file?

Author Comment

ID: 36716547

Here is an export to text - hope that is ok.

Author Comment

ID: 36848780
Hi Lowfatspread,
I have started a new thread in the Maths and Science area.
Thanks for your help so far - if you do get a chance to play with the sample data please let me know how you get on.
Cheers, Adrian

Author Closing Comment

ID: 36849329
Thanks for valuable input to the discussion - I have a long way to go with this!
Will check out prolog too!

Featured Post

Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Recently we ran in to an issue while running some SQL jobs where we were trying to process the cubes.  We got an error saying failure stating 'NT SERVICE\SQLSERVERAGENT does not have access to Analysis Services. So this is a way to automate that wit…
It is possible to export the data of a SQL Table in SSMS and generate INSERT statements. It's neatly tucked away in the generate scripts option of a database.
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

623 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question