Solved

Simple (?) Correlation Matrix in SQL

Posted on 2011-09-28
14
693 Views
Last Modified: 2012-06-27
Hello Experts,

I have responses to a quiz sitting in a table something like

Student_ID   Question_ID   Correct(Y/N)

I would like to correlate the likelihood of a question being answered correctly to all of the other questions and their likelihood of being answered correctly. My friend recently mumbled something about "correlation matrix" and web reading makes the pearson coefficient sound useful but I don't know much about statistics.

In the long run, I want to be able to ask a series of questions about any topic which are then graded either Correct or Incorrect (no middle ground) and then be able to make predictions about other questions in the set a person may be likely to get right or wrong (e.g. you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

Rather than me making assumptions about topics and concepts that go together, I would like to be dynamically gathering data and understanding how students connect different concepts (and yes hoping to find lots of surprising connections!).

In a perfect world I would like to do this all in SQL Server itself without having to move data back and forth into statistical packages - ideally the model can keep building based on new inputs and adapt over time.


0
Comment
Question by:Split_Pin
  • 7
  • 7
14 Comments
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
not clear how you'd do this in pure t-sql...

however this (to me) sounds like neural net processing , or Expert Systems processing to some extent.

you may also wish to look into Baye's theorem (new scientist july 2011), i don't know if its related to the pearson coefficient...


good luck
0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
you may wish to assign the question to some maths/statistics topic areas or seek moderator assistance...
0
 

Author Comment

by:Split_Pin
Comment Utility
Thank you lowfatspread

Yesterday I found an interesting article about how to apply the pearson coefficient in SQL which has encouraged me though I am not 100% sure I am doing it right...
http://www.vanheusden.com/misc/pearson.php

Hoping that by trolling here I might hit a stats person who dabbles in SQL or vice versa.
I will check out the article you suggested on Bayes theorem.
Cheers, Split_Pin

0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
hmmm

not sure how you translate your yes/no values for the calculations ....

and the sql shown , to me is slightly suspect as the left outer join used is then ignored by not allowing for nulls in the where clause... so why wasn't an inner join specified in the first place...

i'm still unclear as to how you will distinguish between the flour and water based groups ... or is that supposed to be external to the process?
 

SELECT  
        user1, user2,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
FROM
        (SELECT 
                n1.user AS user1,
                n2.user AS user2,
                SUM(n1.rating) AS sum1,
                SUM(n2.rating) AS sum2,
                SUM(n1.rating * n1.rating) AS sum1sq,
                SUM(n2.rating * n2.rating) AS sum2sq,
                SUM(n1.rating * n2.rating) AS psum,
                COUNT(*) AS n
        FROM
                testdata AS n1
	LEFT JOIN
		testdata AS n2
	ON
		n1.movie = n2.movie
        WHERE   
                n1.user > n2.user
	GROUP BY
		n1.user, n2.user) AS step1
ORDER BY
        r DESC,
        n DESC

Open in new window

0
 
LVL 50

Accepted Solution

by:
Lowfatspread earned 500 total points
Comment Utility
you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

you maybe better off considering a language like prolog which would allow you to state you facts  and assist in inferring further relationships...
0
 

Author Comment

by:Split_Pin
Comment Utility
I am not saying that prolog doesn't sound totally awesome but ...

Say my data set looks like this (I am student number 1, and possibly from another planet as my general knowledge seems rather limited).

Student_ID        Question_ID       Question Text                                               My Response             Correct(Y/N)
1                        1                       Bread is made from flour true or false              TRUE                      1
1                        2                       Cakes are made from flour true or false      TRUE                      1
1                        3                       Water is flammable true or false                      TRUE                      0
1                        4                       Ice is a type of gas true or false                      TRUE                      0

What I am hoping for is a SQL function/procedure/query that will calculate the correlation between every question. As more students try the quiz, we might find there is a stronger correlation between certain pairs of questions (or maybe larger combinations too) being answered correctly (or incorrectly) than others.

Incidentally a friend tells me something like this is used to evaluate exam responses to detect cheating (on the basis that there are certain things you should get right if you got other similar things right - however I don't want to superimpose what I think is "similar" - I want the raw data to tell me knowledge combinations that the study population happens to know coincidentally

At this stage in the game, the program does not have to interpret the question, the data set shows if it was answered correctly or not.

I guess worst case scenario is download some stats packages and run the matrix offline periodically.

PS: Thank you for coming this far with me in the discussion -  Its further than I got with my work buddies!!
0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
help me

what value do you assign an individual question answer  to map into the "rating"  for the sample sql given so far?

you want to correlate a question to all the others...
and/or groups of questions?

0
Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

 

Author Comment

by:Split_Pin
Comment Utility
Playing with this yesterday, I plugged the field "Correct" into that query in place of rating - this field had either a 1 or 0 value, and it seemed to work.

What I think it was returning however was the overall correlation between each respondents (correlation between students, not the questions - or in the example from that site, the reviewers not the movie).

To start with, I would be thrilled just to get the correlation between every possible combination of questions. This by itself would let me arrange questions in order from strongest to weakest connection to existing student knowledge.

0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
ok i wasnt sure a 1/0 would work ... divide by zero etc ...  ( wondered about 1/-1 or 0.5/-0.5)

let  me see if i can do something... do you have a larger sample dataset available for testing (say 100-500 rows , 10-20 students ?) that you could post

my response would be a couple of hours at least <other commitments...;-)>
0
 

Author Comment

by:Split_Pin
Comment Utility
I've attached a simple demo I made up yesterday, 10 Students answering 10 questions.
The actual responses are randomly right (1) or wrong (0), the questions have no actual text associated to them at all.

The data is in Table called Sample1

I also have 3 queries in here where I was trying to bend the pearson stuff to suit my needs (and I think I sort of make it show the best question for each student to revise, based on student to student correlation and identifying questions "I got wrong but similar students got right"... At least thats what I was taking a wild stab at.

If you can come up with anything based on the table that correlates question to question that would be awesome (no rush, this is a long term project I am only starting to finally devote time to - no doubt you are very busy too). I realise the sample set is not large enough to provide any reliable conclusions, but at this stage the mechanism is the important part, which I can then put into real testing with a meaningful topic.

Once again, thank you for coming this far with me!
Cheers, Split_Pin
 Sample-Testing-Result-Correlatio.mdb
0
 
LVL 50

Expert Comment

by:Lowfatspread
Comment Utility
i dont have access any chance of a text/csv file?
0
 

Author Comment

by:Split_Pin
Comment Utility
Sample1.txt

Here is an export to text - hope that is ok.
0
 

Author Comment

by:Split_Pin
Comment Utility
Hi Lowfatspread,
I have started a new thread in the Maths and Science area.
Thanks for your help so far - if you do get a chance to play with the sample data please let me know how you get on.
Cheers, Adrian
0
 

Author Closing Comment

by:Split_Pin
Comment Utility
Thanks for valuable input to the discussion - I have a long way to go with this!
Will check out prolog too!
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

When you hear the word proxy, you may become apprehensive. This article will help you to understand Proxy and when it is useful. Let's talk Proxy for SQL Server. (Not in terms of Internet access.) Typically, you'll run into this type of problem w…
Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now