Go Premium for a chance to win a PS4. Enter to Win


Correlating Exam Answers to each other

Posted on 2011-09-29
Medium Priority
Last Modified: 2012-05-12
Hello Mathematics Geniuses,

I would like to know a way to construct a dynamic correlation matrix (ideally in Sql Server) so that I can compare each exam question (the likelihood of it being answered correctly) in relation to every other question.

As responses will be stored in a SQL Server table, it would be great if somehow the correlation can be done in SQL Server too, so as more people complete the exam the dataset on which the correlations are based can grow and become more reliable.

The starting data set will be very simple, along the lines of

Student_ID    Question_ID   Correct(Y/N)   Timestamp

I am looking to find connections between pairs or larger combinations of questions, based only on the students ability to answer them correctly, thus avoiding superimposing assumptions about the relationships between different topics. The end result being that questions in exams conducted online can be ordered (dynamically) from easiest to hardest, based on responses made so far. It may also be used to skip questions that are likely to be correct and thus focus on other areas of greater challenge.

Apologies if that does not sound terribly clear and thanks for your time!

Question by:Split_Pin
  • 2
  • 2
LVL 39

Expert Comment

by:Aaron Tomosky
ID: 36853122
Does each student only take the test once? If not you need to add a testId to keep each test together.

Getting stats on correct vs incorrect to judge difficulty is easy, correlating across questions is the hard part. I'll think on it more...

Author Comment

ID: 36890403
Thats a valuable insight aarontomosky - I was thinking of using the minimum timestamp for each Student/Question combination to separate the tests - particularly if used for test-revise-retest where hopefully the student will move towards 100% correct. I am guessing the correlation I really want to know is based on the first time attempt only.

Maybe my use of the words "likelihood of it being answered correctly" is not even right - but in plain English, we could guess that if Mr Split_Pin knows what Cheese is, but does not know what Correlation is, then its "likely" he knows what Milk is, and accordingly unlikely he knows what the Pearson Coefficient is  (And perhaps there is no correlation at all between knowledge of dairy products and knowledge of stats).

The trick is, I don't want to invent any categories - rather I want to pour a whole bunch of questions into my database and have the students show me (through the stats) which ideas actually associate (postively or negatively).

Cheers, Split_Pin
LVL 39

Assisted Solution

by:Aaron Tomosky
Aaron Tomosky earned 1000 total points
ID: 36891910
I can roughly think of a way to do this, but it may be rough to keep it all in tsql code.
Start with q1, sum all the other correct answers here q1 is true. For the ones over x% correct we add "q1" to their column in another table.

Do this for each question and you will have a list of associated questions, maybe even giving extra weight somehow if pairs are related both ways.

Just to look at all the data in a square matrix, you could put .79q1 or something to show that 79% of people that got q1 right got this right.

Just an outline for now but The more I think on it the more it seems doable
LVL 37

Accepted Solution

TommySzalapski earned 1000 total points
ID: 36892402
The right way to do this (from a statistician point of view) would be to use logistic regression. Model a wrong answer as 0 and a correct answer as 1 and choose from any of the thousands of regression packages out there.
You could also use linear regression, but logistic is better when the dependent variable is binary (two options).

If you've never taken any stat courses and are worried at the amount of math involved (which isn't really that much) then you could just look at the percent that got the two right compared to the average amount that got both of any pair of quesions right.

One problem that you will have in either case is the age-old causation vs. correlation problems.
x and y are correlated so did x cause y or did y cause x or did some other factor cause both?

Author Comment

ID: 36895516
Greetings, hope you are both enjoying the weekend.
If we just focus for a moment on

just look at the percent that got the two right compared to the average amount that got both of any pair of questions right.

This is really sounding like something SQL could be good at [and yes the maths worries me :-) ]

If we join every response every other response by the same student (like a self join on a.Question_ID > b.Question_ID AND a.Student_ID = b.Student_ID) we would be mapping every pair (taking that 2 and 3 is same as 3 and 2, and we dont need to map it both ways). If we gave the pairs a result maybe based on 1 * 1 = 1 (both correct) we can then work out the % correct for any random pair and then seek combinations that exceed the average...

I guess by joining again to another version of the same table we could work out triplicates (all 3 out of 3 correct) too, and so on.

I think this method would hold up as well in cases where students have not all answered every question, say there are 100 questions and Frank answers 50, Jean answers 30 (some of which are the same as Frank and some of which are different) we can still cross map all the known pairs.

Please do let me know if I sound like I have completely lost the point here!!
Really appreciate your input. It sounds like aarontomovsky's idea could help to organise the end result into a table than can be used to select the next question to be asked.

Well I had better start prototyping - will split the points as you've both got me thinking along the right lines.
Cheers, Split_Pin


Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Complex Numbers are funny things.  Many people have a basic understanding of them, some a more advanced.  The confusion usually arises when that pesky i (or j for Electrical Engineers) appears and understanding the meaning of a square root of a nega…
We are taking giant steps in technological advances in the field of wireless telephony. At just 10 years since the advent of smartphones, it is crucial to examine the benefits and disadvantages that have been report to us.
This is a video describing the growing solar energy use in Utah. This is a topic that greatly interests me and so I decided to produce a video about it.
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…

971 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question