Go Premium for a chance to win a PS4. Enter to Win


Simple (?) Correlation Matrix in SQL

Posted on 2011-09-28
Medium Priority
Last Modified: 2012-06-27
Hello Experts,

I have responses to a quiz sitting in a table something like

Student_ID   Question_ID   Correct(Y/N)

I would like to correlate the likelihood of a question being answered correctly to all of the other questions and their likelihood of being answered correctly. My friend recently mumbled something about "correlation matrix" and web reading makes the pearson coefficient sound useful but I don't know much about statistics.

In the long run, I want to be able to ask a series of questions about any topic which are then graded either Correct or Incorrect (no middle ground) and then be able to make predictions about other questions in the set a person may be likely to get right or wrong (e.g. you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

Rather than me making assumptions about topics and concepts that go together, I would like to be dynamically gathering data and understanding how students connect different concepts (and yes hoping to find lots of surprising connections!).

In a perfect world I would like to do this all in SQL Server itself without having to move data back and forth into statistical packages - ideally the model can keep building based on new inputs and adapt over time.

Question by:Split_Pin
  • 7
  • 7
LVL 50

Expert Comment

ID: 36715719
not clear how you'd do this in pure t-sql...

however this (to me) sounds like neural net processing , or Expert Systems processing to some extent.

you may also wish to look into Baye's theorem (new scientist july 2011), i don't know if its related to the pearson coefficient...

good luck
LVL 50

Expert Comment

ID: 36715732
you may wish to assign the question to some maths/statistics topic areas or seek moderator assistance...

Author Comment

ID: 36715995
Thank you lowfatspread

Yesterday I found an interesting article about how to apply the pearson coefficient in SQL which has encouraged me though I am not 100% sure I am doing it right...

Hoping that by trolling here I might hit a stats person who dabbles in SQL or vice versa.
I will check out the article you suggested on Bayes theorem.
Cheers, Split_Pin

Get free NFR key for Veeam Availability Suite 9.5

Veeam is happy to provide a free NFR license (1 year, 2 sockets) to all certified IT Pros. The license allows for the non-production use of Veeam Availability Suite v9.5 in your home lab, without any feature limitations. It works for both VMware and Hyper-V environments

LVL 50

Expert Comment

ID: 36716050

not sure how you translate your yes/no values for the calculations ....

and the sql shown , to me is slightly suspect as the left outer join used is then ignored by not allowing for nulls in the where clause... so why wasn't an inner join specified in the first place...

i'm still unclear as to how you will distinguish between the flour and water based groups ... or is that supposed to be external to the process?

        user1, user2,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
                n1.user AS user1,
                n2.user AS user2,
                SUM(n1.rating) AS sum1,
                SUM(n2.rating) AS sum2,
                SUM(n1.rating * n1.rating) AS sum1sq,
                SUM(n2.rating * n2.rating) AS sum2sq,
                SUM(n1.rating * n2.rating) AS psum,
                COUNT(*) AS n
                testdata AS n1
		testdata AS n2
		n1.movie = n2.movie
                n1.user > n2.user
		n1.user, n2.user) AS step1
        r DESC,
        n DESC

Open in new window

LVL 50

Accepted Solution

Lowfatspread earned 2000 total points
ID: 36716061
you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

you maybe better off considering a language like prolog which would allow you to state you facts  and assist in inferring further relationships...

Author Comment

ID: 36716151
I am not saying that prolog doesn't sound totally awesome but ...

Say my data set looks like this (I am student number 1, and possibly from another planet as my general knowledge seems rather limited).

Student_ID        Question_ID       Question Text                                               My Response             Correct(Y/N)
1                        1                       Bread is made from flour true or false              TRUE                      1
1                        2                       Cakes are made from flour true or false      TRUE                      1
1                        3                       Water is flammable true or false                      TRUE                      0
1                        4                       Ice is a type of gas true or false                      TRUE                      0

What I am hoping for is a SQL function/procedure/query that will calculate the correlation between every question. As more students try the quiz, we might find there is a stronger correlation between certain pairs of questions (or maybe larger combinations too) being answered correctly (or incorrectly) than others.

Incidentally a friend tells me something like this is used to evaluate exam responses to detect cheating (on the basis that there are certain things you should get right if you got other similar things right - however I don't want to superimpose what I think is "similar" - I want the raw data to tell me knowledge combinations that the study population happens to know coincidentally

At this stage in the game, the program does not have to interpret the question, the data set shows if it was answered correctly or not.

I guess worst case scenario is download some stats packages and run the matrix offline periodically.

PS: Thank you for coming this far with me in the discussion -  Its further than I got with my work buddies!!
LVL 50

Expert Comment

ID: 36716240
help me

what value do you assign an individual question answer  to map into the "rating"  for the sample sql given so far?

you want to correlate a question to all the others...
and/or groups of questions?


Author Comment

ID: 36716282
Playing with this yesterday, I plugged the field "Correct" into that query in place of rating - this field had either a 1 or 0 value, and it seemed to work.

What I think it was returning however was the overall correlation between each respondents (correlation between students, not the questions - or in the example from that site, the reviewers not the movie).

To start with, I would be thrilled just to get the correlation between every possible combination of questions. This by itself would let me arrange questions in order from strongest to weakest connection to existing student knowledge.

LVL 50

Expert Comment

ID: 36716341
ok i wasnt sure a 1/0 would work ... divide by zero etc ...  ( wondered about 1/-1 or 0.5/-0.5)

let  me see if i can do something... do you have a larger sample dataset available for testing (say 100-500 rows , 10-20 students ?) that you could post

my response would be a couple of hours at least <other commitments...;-)>

Author Comment

ID: 36716459
I've attached a simple demo I made up yesterday, 10 Students answering 10 questions.
The actual responses are randomly right (1) or wrong (0), the questions have no actual text associated to them at all.

The data is in Table called Sample1

I also have 3 queries in here where I was trying to bend the pearson stuff to suit my needs (and I think I sort of make it show the best question for each student to revise, based on student to student correlation and identifying questions "I got wrong but similar students got right"... At least thats what I was taking a wild stab at.

If you can come up with anything based on the table that correlates question to question that would be awesome (no rush, this is a long term project I am only starting to finally devote time to - no doubt you are very busy too). I realise the sample set is not large enough to provide any reliable conclusions, but at this stage the mechanism is the important part, which I can then put into real testing with a meaningful topic.

Once again, thank you for coming this far with me!
Cheers, Split_Pin
LVL 50

Expert Comment

ID: 36716485
i dont have access any chance of a text/csv file?

Author Comment

ID: 36716547

Here is an export to text - hope that is ok.

Author Comment

ID: 36848780
Hi Lowfatspread,
I have started a new thread in the Maths and Science area.
Thanks for your help so far - if you do get a chance to play with the sample data please let me know how you get on.
Cheers, Adrian

Author Closing Comment

ID: 36849329
Thanks for valuable input to the discussion - I have a long way to go with this!
Will check out prolog too!

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
An alternative to the "For XML" way of pivoting and concatenating result sets into strings, and an easy introduction to "common table expressions" (CTEs). Being someone who is always looking for alternatives to "work your data", I came across this …
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.

916 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question