Solved

Simple (?) Correlation Matrix in SQL

Posted on 2011-09-28
14
705 Views
Last Modified: 2012-06-27
Hello Experts,

I have responses to a quiz sitting in a table something like

Student_ID   Question_ID   Correct(Y/N)

I would like to correlate the likelihood of a question being answered correctly to all of the other questions and their likelihood of being answered correctly. My friend recently mumbled something about "correlation matrix" and web reading makes the pearson coefficient sound useful but I don't know much about statistics.

In the long run, I want to be able to ask a series of questions about any topic which are then graded either Correct or Incorrect (no middle ground) and then be able to make predictions about other questions in the set a person may be likely to get right or wrong (e.g. you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

Rather than me making assumptions about topics and concepts that go together, I would like to be dynamically gathering data and understanding how students connect different concepts (and yes hoping to find lots of surprising connections!).

In a perfect world I would like to do this all in SQL Server itself without having to move data back and forth into statistical packages - ideally the model can keep building based on new inputs and adapt over time.


0
Comment
Question by:Split_Pin
  • 7
  • 7
14 Comments
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36715719
not clear how you'd do this in pure t-sql...

however this (to me) sounds like neural net processing , or Expert Systems processing to some extent.

you may also wish to look into Baye's theorem (new scientist july 2011), i don't know if its related to the pearson coefficient...


good luck
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36715732
you may wish to assign the question to some maths/statistics topic areas or seek moderator assistance...
0
 

Author Comment

by:Split_Pin
ID: 36715995
Thank you lowfatspread

Yesterday I found an interesting article about how to apply the pearson coefficient in SQL which has encouraged me though I am not 100% sure I am doing it right...
http://www.vanheusden.com/misc/pearson.php

Hoping that by trolling here I might hit a stats person who dabbles in SQL or vice versa.
I will check out the article you suggested on Bayes theorem.
Cheers, Split_Pin

0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716050
hmmm

not sure how you translate your yes/no values for the calculations ....

and the sql shown , to me is slightly suspect as the left outer join used is then ignored by not allowing for nulls in the where clause... so why wasn't an inner join specified in the first place...

i'm still unclear as to how you will distinguish between the flour and water based groups ... or is that supposed to be external to the process?
 

SELECT  
        user1, user2,
        ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
        n
FROM
        (SELECT 
                n1.user AS user1,
                n2.user AS user2,
                SUM(n1.rating) AS sum1,
                SUM(n2.rating) AS sum2,
                SUM(n1.rating * n1.rating) AS sum1sq,
                SUM(n2.rating * n2.rating) AS sum2sq,
                SUM(n1.rating * n2.rating) AS psum,
                COUNT(*) AS n
        FROM
                testdata AS n1
	LEFT JOIN
		testdata AS n2
	ON
		n1.movie = n2.movie
        WHERE   
                n1.user > n2.user
	GROUP BY
		n1.user, n2.user) AS step1
ORDER BY
        r DESC,
        n DESC

Open in new window

0
 
LVL 50

Accepted Solution

by:
Lowfatspread earned 500 total points
ID: 36716061
you correctly answered that bread is made of flour, so perhaps you also know that cakes are made of flour ... vice versa I do not know what water is, so its unlikely I know what ice is).

you maybe better off considering a language like prolog which would allow you to state you facts  and assist in inferring further relationships...
0
 

Author Comment

by:Split_Pin
ID: 36716151
I am not saying that prolog doesn't sound totally awesome but ...

Say my data set looks like this (I am student number 1, and possibly from another planet as my general knowledge seems rather limited).

Student_ID        Question_ID       Question Text                                               My Response             Correct(Y/N)
1                        1                       Bread is made from flour true or false              TRUE                      1
1                        2                       Cakes are made from flour true or false      TRUE                      1
1                        3                       Water is flammable true or false                      TRUE                      0
1                        4                       Ice is a type of gas true or false                      TRUE                      0

What I am hoping for is a SQL function/procedure/query that will calculate the correlation between every question. As more students try the quiz, we might find there is a stronger correlation between certain pairs of questions (or maybe larger combinations too) being answered correctly (or incorrectly) than others.

Incidentally a friend tells me something like this is used to evaluate exam responses to detect cheating (on the basis that there are certain things you should get right if you got other similar things right - however I don't want to superimpose what I think is "similar" - I want the raw data to tell me knowledge combinations that the study population happens to know coincidentally

At this stage in the game, the program does not have to interpret the question, the data set shows if it was answered correctly or not.

I guess worst case scenario is download some stats packages and run the matrix offline periodically.

PS: Thank you for coming this far with me in the discussion -  Its further than I got with my work buddies!!
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716240
help me

what value do you assign an individual question answer  to map into the "rating"  for the sample sql given so far?

you want to correlate a question to all the others...
and/or groups of questions?

0
 

Author Comment

by:Split_Pin
ID: 36716282
Playing with this yesterday, I plugged the field "Correct" into that query in place of rating - this field had either a 1 or 0 value, and it seemed to work.

What I think it was returning however was the overall correlation between each respondents (correlation between students, not the questions - or in the example from that site, the reviewers not the movie).

To start with, I would be thrilled just to get the correlation between every possible combination of questions. This by itself would let me arrange questions in order from strongest to weakest connection to existing student knowledge.

0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716341
ok i wasnt sure a 1/0 would work ... divide by zero etc ...  ( wondered about 1/-1 or 0.5/-0.5)

let  me see if i can do something... do you have a larger sample dataset available for testing (say 100-500 rows , 10-20 students ?) that you could post

my response would be a couple of hours at least <other commitments...;-)>
0
 

Author Comment

by:Split_Pin
ID: 36716459
I've attached a simple demo I made up yesterday, 10 Students answering 10 questions.
The actual responses are randomly right (1) or wrong (0), the questions have no actual text associated to them at all.

The data is in Table called Sample1

I also have 3 queries in here where I was trying to bend the pearson stuff to suit my needs (and I think I sort of make it show the best question for each student to revise, based on student to student correlation and identifying questions "I got wrong but similar students got right"... At least thats what I was taking a wild stab at.

If you can come up with anything based on the table that correlates question to question that would be awesome (no rush, this is a long term project I am only starting to finally devote time to - no doubt you are very busy too). I realise the sample set is not large enough to provide any reliable conclusions, but at this stage the mechanism is the important part, which I can then put into real testing with a meaningful topic.

Once again, thank you for coming this far with me!
Cheers, Split_Pin
 Sample-Testing-Result-Correlatio.mdb
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36716485
i dont have access any chance of a text/csv file?
0
 

Author Comment

by:Split_Pin
ID: 36716547
Sample1.txt

Here is an export to text - hope that is ok.
0
 

Author Comment

by:Split_Pin
ID: 36848780
Hi Lowfatspread,
I have started a new thread in the Maths and Science area.
Thanks for your help so far - if you do get a chance to play with the sample data please let me know how you get on.
Cheers, Adrian
0
 

Author Closing Comment

by:Split_Pin
ID: 36849329
Thanks for valuable input to the discussion - I have a long way to go with this!
Will check out prolog too!
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
For both online and offline retail, the cross-channel business is the most recent pattern in the B2C trade space.
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now