Solved

Simple (?) Correlation Matrix in SQL

Posted on 2011-09-28
14
715 Views
Last Modified: 2012-06-27
Hello Experts,

I have responses to a quiz sitting in a table something like

Student_ID   Question_ID   Correct(Y/N)

I would like to correlate the likelihood of a question being answered correctly to all of the other questions and their likelihood of being answered correctly. My friend recently mumbled something about "correlation matrix" and web reading makes the pearson coefficient sound useful but I don't know much about statistics.

In the long run, I want to be able to ask a series of questions about any topic which are then graded either Correct or Incorrect (no middle ground) and then be able to make predictions about other questions in the set a person may be likely to get right or wrong (e.g. you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

Rather than me making assumptions about topics and concepts that go together, I would like to be dynamically gathering data and understanding how students connect different concepts (and yes hoping to find lots of surprising connections!).

In a perfect world I would like to do this all in SQL Server itself without having to move data back and forth into statistical packages - ideally the model can keep building based on new inputs and adapt over time.


0
Comment
Question by:Split_Pin
  • 7
  • 7
14 Comments
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36715719
not clear how you'd do this in pure t-sql...

however this (to me) sounds like neural net processing , or Expert Systems processing to some extent.

you may also wish to look into Baye's theorem (new scientist july 2011), i don't know if its related to the pearson coefficient...


good luck
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36715732
you may wish to assign the question to some maths/statistics topic areas or seek moderator assistance...
0
 

Author Comment

by:Split_Pin
ID: 36715995
Thank you lowfatspread

Yesterday I found an interesting article about how to apply the pearson coefficient in SQL which has encouraged me though I am not 100% sure I am doing it right...
http://www.vanheusden.com/misc/pearson.php

Hoping that by trolling here I might hit a stats person who dabbles in SQL or vice versa.
I will check out the article you suggested on Bayes theorem.
Cheers, Split_Pin

0
VMware Disaster Recovery and Data Protection

In this expert guide, you’ll learn about the components of a Modern Data Center. You will use cases for the value-added capabilities of Veeam®, including combining backup and replication for VMware disaster recovery and using replication for data center migration.

 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716050
hmmm

not sure how you translate your yes/no values for the calculations ....

and the sql shown , to me is slightly suspect as the left outer join used is then ignored by not allowing for nulls in the where clause... so why wasn't an inner join specified in the first place...

i'm still unclear as to how you will distinguish between the flour and water based groups ... or is that supposed to be external to the process?
 

SELECT  
        user1, user2,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
FROM
        (SELECT 
                n1.user AS user1,
                n2.user AS user2,
                SUM(n1.rating) AS sum1,
                SUM(n2.rating) AS sum2,
                SUM(n1.rating * n1.rating) AS sum1sq,
                SUM(n2.rating * n2.rating) AS sum2sq,
                SUM(n1.rating * n2.rating) AS psum,
                COUNT(*) AS n
        FROM
                testdata AS n1
	LEFT JOIN
		testdata AS n2
	ON
		n1.movie = n2.movie
        WHERE   
                n1.user > n2.user
	GROUP BY
		n1.user, n2.user) AS step1
ORDER BY
        r DESC,
        n DESC

Open in new window

0
 
LVL 50

Accepted Solution

by:
Lowfatspread earned 500 total points
ID: 36716061
you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

you maybe better off considering a language like prolog which would allow you to state you facts  and assist in inferring further relationships...
0
 

Author Comment

by:Split_Pin
ID: 36716151
I am not saying that prolog doesn't sound totally awesome but ...

Say my data set looks like this (I am student number 1, and possibly from another planet as my general knowledge seems rather limited).

Student_ID        Question_ID       Question Text                                               My Response             Correct(Y/N)
1                        1                       Bread is made from flour true or false              TRUE                      1
1                        2                       Cakes are made from flour true or false      TRUE                      1
1                        3                       Water is flammable true or false                      TRUE                      0
1                        4                       Ice is a type of gas true or false                      TRUE                      0

What I am hoping for is a SQL function/procedure/query that will calculate the correlation between every question. As more students try the quiz, we might find there is a stronger correlation between certain pairs of questions (or maybe larger combinations too) being answered correctly (or incorrectly) than others.

Incidentally a friend tells me something like this is used to evaluate exam responses to detect cheating (on the basis that there are certain things you should get right if you got other similar things right - however I don't want to superimpose what I think is "similar" - I want the raw data to tell me knowledge combinations that the study population happens to know coincidentally

At this stage in the game, the program does not have to interpret the question, the data set shows if it was answered correctly or not.

I guess worst case scenario is download some stats packages and run the matrix offline periodically.

PS: Thank you for coming this far with me in the discussion -  Its further than I got with my work buddies!!
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716240
help me

what value do you assign an individual question answer  to map into the "rating"  for the sample sql given so far?

you want to correlate a question to all the others...
and/or groups of questions?

0
 

Author Comment

by:Split_Pin
ID: 36716282
Playing with this yesterday, I plugged the field "Correct" into that query in place of rating - this field had either a 1 or 0 value, and it seemed to work.

What I think it was returning however was the overall correlation between each respondents (correlation between students, not the questions - or in the example from that site, the reviewers not the movie).

To start with, I would be thrilled just to get the correlation between every possible combination of questions. This by itself would let me arrange questions in order from strongest to weakest connection to existing student knowledge.

0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716341
ok i wasnt sure a 1/0 would work ... divide by zero etc ...  ( wondered about 1/-1 or 0.5/-0.5)

let  me see if i can do something... do you have a larger sample dataset available for testing (say 100-500 rows , 10-20 students ?) that you could post

my response would be a couple of hours at least <other commitments...;-)>
0
 

Author Comment

by:Split_Pin
ID: 36716459
I've attached a simple demo I made up yesterday, 10 Students answering 10 questions.
The actual responses are randomly right (1) or wrong (0), the questions have no actual text associated to them at all.

The data is in Table called Sample1

I also have 3 queries in here where I was trying to bend the pearson stuff to suit my needs (and I think I sort of make it show the best question for each student to revise, based on student to student correlation and identifying questions "I got wrong but similar students got right"... At least thats what I was taking a wild stab at.

If you can come up with anything based on the table that correlates question to question that would be awesome (no rush, this is a long term project I am only starting to finally devote time to - no doubt you are very busy too). I realise the sample set is not large enough to provide any reliable conclusions, but at this stage the mechanism is the important part, which I can then put into real testing with a meaningful topic.

Once again, thank you for coming this far with me!
Cheers, Split_Pin
 Sample-Testing-Result-Correlatio.mdb
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716485
i dont have access any chance of a text/csv file?
0
 

Author Comment

by:Split_Pin
ID: 36716547
Sample1.txt

Here is an export to text - hope that is ok.
0
 

Author Comment

by:Split_Pin
ID: 36848780
Hi Lowfatspread,
I have started a new thread in the Maths and Science area.
Thanks for your help so far - if you do get a chance to play with the sample data please let me know how you get on.
Cheers, Adrian
0
 

Author Closing Comment

by:Split_Pin
ID: 36849329
Thanks for valuable input to the discussion - I have a long way to go with this!
Will check out prolog too!
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

JSON is being used more and more, besides XML, and you surely wanted to parse the data out into SQL instead of doing it in some Javascript. The below function in SQL Server can do the job for you, returning a quick table with the parsed data.
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question