I would like to know a way to construct a dynamic correlation matrix (ideally in Sql Server) so that I can compare each exam question (the likelihood of it being answered correctly) in relation to every other question.

As responses will be stored in a SQL Server table, it would be great if somehow the correlation can be done in SQL Server too, so as more people complete the exam the dataset on which the correlations are based can grow and become more reliable.

The starting data set will be very simple, along the lines of

Student_ID Question_ID Correct(Y/N) Timestamp

I am looking to find connections between pairs or larger combinations of questions, based only on the students ability to answer them correctly, thus avoiding superimposing assumptions about the relationships between different topics. The end result being that questions in exams conducted online can be ordered (dynamically) from easiest to hardest, based on responses made so far. It may also be used to skip questions that are likely to be correct and thus focus on other areas of greater challenge.

Apologies if that does not sound terribly clear and thanks for your time!

Does each student only take the test once? If not you need to add a testId to keep each test together.

Getting stats on correct vs incorrect to judge difficulty is easy, correlating across questions is the hard part. I'll think on it more...

0

Split_PinAuthor Commented:

Thats a valuable insight aarontomosky - I was thinking of using the minimum timestamp for each Student/Question combination to separate the tests - particularly if used for test-revise-retest where hopefully the student will move towards 100% correct. I am guessing the correlation I really want to know is based on the first time attempt only.

Maybe my use of the words "likelihood of it being answered correctly" is not even right - but in plain English, we could guess that if Mr Split_Pin knows what Cheese is, but does not know what Correlation is, then its "likely" he knows what Milk is, and accordingly unlikely he knows what the Pearson Coefficient is (And perhaps there is no correlation at all between knowledge of dairy products and knowledge of stats).

The trick is, I don't want to invent any categories - rather I want to pour a whole bunch of questions into my database and have the students show me (through the stats) which ideas actually associate (postively or negatively).

I can roughly think of a way to do this, but it may be rough to keep it all in tsql code.
Start with q1, sum all the other correct answers here q1 is true. For the ones over x% correct we add "q1" to their column in another table.

Do this for each question and you will have a list of associated questions, maybe even giving extra weight somehow if pairs are related both ways.

Just to look at all the data in a square matrix, you could put .79q1 or something to show that 79% of people that got q1 right got this right.

Just an outline for now but The more I think on it the more it seems doable

The right way to do this (from a statistician point of view) would be to use logistic regression. Model a wrong answer as 0 and a correct answer as 1 and choose from any of the thousands of regression packages out there.
You could also use linear regression, but logistic is better when the dependent variable is binary (two options).

If you've never taken any stat courses and are worried at the amount of math involved (which isn't really that much) then you could just look at the percent that got the two right compared to the average amount that got both of any pair of quesions right.

One problem that you will have in either case is the age-old causation vs. correlation problems.
x and y are correlated so did x cause y or did y cause x or did some other factor cause both?

0

Split_PinAuthor Commented:

Greetings, hope you are both enjoying the weekend.
If we just focus for a moment on

just look at the percent that got the two right compared to the average amount that got both of any pair of questions right.

This is really sounding like something SQL could be good at [and yes the maths worries me :-) ]

If we join every response every other response by the same student (like a self join on a.Question_ID > b.Question_ID AND a.Student_ID = b.Student_ID) we would be mapping every pair (taking that 2 and 3 is same as 3 and 2, and we dont need to map it both ways). If we gave the pairs a result maybe based on 1 * 1 = 1 (both correct) we can then work out the % correct for any random pair and then seek combinations that exceed the average...

I guess by joining again to another version of the same table we could work out triplicates (all 3 out of 3 correct) too, and so on.

I think this method would hold up as well in cases where students have not all answered every question, say there are 100 questions and Frank answers 50, Jean answers 30 (some of which are the same as Frank and some of which are different) we can still cross map all the known pairs.

Please do let me know if I sound like I have completely lost the point here!!
Really appreciate your input. It sounds like aarontomovsky's idea could help to organise the end result into a table than can be used to select the next question to be asked.

Well I had better start prototyping - will split the points as you've both got me thinking along the right lines.
Cheers, Split_Pin

0

Featured Post

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Getting stats on correct vs incorrect to judge difficulty is easy, correlating across questions is the hard part. I'll think on it more...