Survey (Data Analysis)

Hello Math/Stats Experts:

I'm in the midst of developing a survey/questionaire (for an academic project).   Before continuing the survey development, I'd like to get a better understanding of possible analytic methods (upon having received the data later on).


Background on research:
- I have 8 independent variables (IV).  
- I have 1 dependent variable (DV).
- I'd like to prove there is (or is not) a positive/negative correlation between each independent variable and the dependent variable.
- I will NOT test any possible interactions between the independent variables.


Background on survey concept:
- For both independent variables and dependent variable, I will ask several (maybe 3) questions.
- Most (if not all) questions will use a Likert Scale (values of 1-5).   As far as I know, this makes it a "categorical" measurement scale (i.e., "interval" data).    [Let me know if you disagree]
- I will have n number of survey respondents (SR).


Current survey concept... let's say I have 27 questions... 3 for each of the 8 IVs... and 3 for the single DV.
[Btw, the Likert scale values below are complete made up... I just punched in numbers]

Q#       Response of SR_sub_1       Response of SR_sub_2       Response of SR_sub_n
1.   5, 5, 4
2.   4, 3, 5
3.   5, 4, 4
25.  3, 4, 4
26.  2, 3, 3
27.  4, 3, 4


Current concept of data analysis:
- Questions 1:3 pertain to independent variable #1.
- SR_sub_1 answered them as follows: 5, 4, 5
- SR_sub_2 answered them as follows: 5, 3, 4
- SR_sub_n answered them as follows: 4, 5, 4

- Questions 25:27 pertain to the dependent variable.
- SR_sub_1 answered them as follows: 3, 2, 4
- SR_sub_2 answered them as follows: 4, 3, 3
- SR_sub_n answered them as follows: 4, 3, 4

Now, I was thinking to use "index scores" (i.e., averages) for each.   If so, I'd have the following data:
- Index score for independent variable #1.
- SR_sub_1 = average of (5, 4, 5) = 4.67
- SR_sub_2 = average of (5, 3, 4) = 4.00
- SR_sub_n = average of (4, 5, 4) = 4.33

- Index score for dependent variable.
- SR_sub_1 = average of (3, 2, 4) = 3.00
- SR_sub_2 = average of (4, 3, 3) = 3.33
- SR_sub_n = average of (4, 3, 4) = 3.67

[Again, the survey respondents' "values" are merely made up right now... their individual values are not important at this moment.]

My questions:
- Is the approach of those "index scores" a valid one?
- If yes, do I simply plot the intersections of [4.67, 3.00] & [4,00, 3.33] & [4.33, 3.67] into a scatter plot?
- If so, what is the recommend statistical analyis (ANOVA, MDA, Chi-Square) to analyze perform the analysis to determine whether or not there is a positive/negative correlation (not causation) between the 8 independent variables and the dependent variable?

Thousand thanks in advance,
Who is Participating?
Fred MarshallConnect With a Mentor PrincipalCommented:
I'm sorry if the language I used was a bit foreign so let me say it another way.
First, let me say that I'm still fuzzy on your nomenclature so let me propose:
"subject" is a person answering questions.
"topic" is a context in which questions are asked.  I understand you have 8 of these and have decided that they are independent variables.  I further understand that you will have some number of questions per topic.
And, it appears, that the dependent variable is populated by the Answers to the questions??  I don't see any other.

You continue to suggest something like correlation and I believe you're on the right track there.  

So, I envision something like this:

For each subject/person you will ask some number of questions in a number of topic areas (8 topics).
For each subject/person and topic you will get a number of answers.
The number of answers will depend on the number of questions.

[At this point it may be important to understand what you're really doing]

Correlation and analysis of variance are two different things so I'm going to ignore the ANOVA comment for now.

Let's say that I ask 12 people 7 questions per topic and that there are 8 topics.  I get 672 answers which are arranged by person, question and topic.

In order to get rid of the personal variations, we combine the answers like this:
For each topic and for each question, we combine the answers for all persons.
Now we have a set of values by topic and question only.

If you want to know how well the answers on one topic correlates with the answers for another topic then you compute the correlation coefficient between each pair of topics.
(I'm assuming that you are keeping the values in the same order according to the question numbers).

Pick topics two at a time.
Multiply the values for each question together.  Add the results.  This is a form of the correlation coefficient: a measure of the correlation between answers for Topic 1 and Topic 2.
Repeat this process for:
T1 & T3
T1 & T4
T1 & T8

T2 & T3
T2 & T4
T2 & T8

T3 & T4
T3 & T5
T7 & T8

Here the inferences (i.e. the correlations) are between the "independent variables".  The "dependent variables" are simply the values you're using in the computations.  I don't see how there can be an inference between the questions and answers without comparing other answers.

That seems pretty simple and I don't know how to make it simpler.

Going back to the "design of experiment" here .... I think it's important to choose a number of questions that will generate enough data points to give you reasonable results. But not knowing the topics nor questions it's hard to be more specific.
Also, I think the implication here is that the questions are all the same across the topics and are in the same order per topic because the answers need to be arranged or "indexed" so the correlations will make any sense.

Also, you may want to remove the mean values after you combine the subject's answer data and before doing the correlations.  That way you can generate negative correlations and a zero correlation will mean that estimate is for zero or no correlation.
Fred MarshallPrincipalCommented:
Let'se see if I can paraphrase the approach:

One approach in notation:
A(s,q,i,d) means Answer according to indices s=subject index, q=question index, i=independent variable index, d=dependent variable index.

So, you have a 4-dimensional space of answer values.

A(1,1,1,1)  to A(n,3,8,3).

I don't think this is what's needed for a couple of reasons:

1) Any question regarding the dependent variable must be in the context of at least the one independent variable or it's out of context.  Isn't that right?

2) A question about an independent variable has no meaning without the dependent variable.  This may be the same as #1 said another way.

You might consider the following:

n subjects

8 independent variables of q values each
(where you should ask if q values reasonably represents the independent variable span)

1 dependent variable of m values
(where you should ask if m values reasonably represents the dependent variable span)

Then each question or test would be posed GIVEN each of the m dependent variable values.
So, there will be m tests per independent variable.

That's also a total of nxqxm questions .... or "tests"

The notation might be:
Q(s,i,d) > A(s,i,d)  Where the matrix is nxqxm Presumably you will want to combine the subject data into a single test result.  So, I would be thinking along the lines of:

A(1,i,d) + A(2,i,d) + ......... + A(n,i,d) = P(i,d)
which is the sum of all subjects for EACH indenpendent variable value and EACH dependent variable value.

and, of course, you could use the square root of the sum of the squares or some such measure instead of a simple sum - depending on what you want / need.

Now you have a 2-dimensional matrix that's pretty easy to handle / envision.
The tests were done with alignment of the columns by stating a specific value for the independent variable in each case.
You said:
I'd like to prove there is (or is not) a positive/negative correlation between each independent variable and the dependent variable.

This suggests correlating the dependent variable vector (which I've not made notation for yet) and each independent variable vector.

So, the have I(1, ....,m) which is the vector of independent variable values.

Compute the correlation of I with A in order to get the correlation coefficient for each independent variable.  Something like:

i=8    d=m
C(i) = sum[I(i)*P(i,d)]
i=1    d=1

I've not been too careful here and I'm a bit worried about mixing up the number of variables and the number of VALUES of each variable that's being use.
But, I hope it's a nudge in the right direction.
ExpExchHelpAuthor Commented:

Thanks for the feedback... wow... lots of stuff that I need to digest.    The coding of subjects and "vectors" seems to be related too much to a math-subject.

Instead, I'm trying to get a clearer understanding about statistical analyses.

Let me recap:
- I have 8 independent variables (you may even want to call them "subjects/topics")
- There will be  a number of questions per subject/topic that each participant will ask.
- I figured that I take the average of all subject-related questions... so I end up with 8 index scores (one per subject/topic... not to be confused with subject/survey participant)
- I'll also ask a few questions on the dependent variable.   The question under that category/DV will also give me an index score for each survey participant.
- Then, I was hoping to apply One-Way ANOVA or any other inferential statistics method to draw conclusions about the 8 topics (IVs) and the single dependent variable (DV).

Point is... let's keep it simple.    Any suggestion on the appropriate analysis methodology that would allow me to determine inferences between the IVs and DV?

Cloud Class® Course: SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

ExpExchHelpAuthor Commented:
Great stuff – all makes sense.   Based on the [T1 & T2], [T1 & T3], ... [T1 & T8]; [T2 & T3], it appears that these measure all interactions among the independent variables.  

Here’s where I get lucky.   Based on my conceptual framework (which has been approved by the committee), I will NOT measure any interactions among the independent variables.    You may wonder why… essentially, the scope of my research topic is already large enough that those interactions do not lend themselves for inclusion.

Assuming the dependent variable (DV) is topic #9, I would only measure [T1 & T9], [T2 & T9], … [T8 & T3].

So, for the sake of argument, if I were to measure a car’s “performance”  – my dependent variable – my independent variables would be, e.g., “engine” [T1]; “tires” [T2], etc.   Obviously, in real life, I’d probably have to look at a variety/interactions, but again, for my particular research this won’t apply.

Sending out the survey, I will get (hopefully) sufficient # of survey responses.    Using SPSS, I should end up with raw data in matrix format where each question becomes a variable (column) and each survey participant/subject’s answers are added into the cases (rows).  
Again, let’s say I’ll ask 3 questions per topic (8 independent variable).   That would mean I have 24 variables for them.   Also, I’ll have a number of question (let’s say 6) for the single dependent variable.   So, I was figuring that variables/columns 25:30 are allocated accordingly.

If that’s my working assumption, I need to perform some form of analysis on questions 1:3 in respect to questions 25:30.   Similarly, I want to analyze questions 6:8 in respect to questions 25:30.  

My familiarity with SPSS is “dangerous”… I’d like to make sure that I perform initial exploratory analysis correctly.    I hope to somehow also verify (internal) “validity” and “reliability” across each/all survey respondents.

Final thoughts/recommendations?

Fred MarshallPrincipalCommented:
I'm in quite a rush right now..... but if you would just not use "independent variable" and "dependent variable" in the discussion - save that for later when you have what you *want* lined up.  I have to run... back this afternoon.
ExpExchHelpAuthor Commented:
I'm not understanding your last comments.  

I clearly indicated (8) relationships between each "independent variable" and the single dependent variable.
Fred MarshallPrincipalCommented:
OK.  Now that I'm back and have a bit more time.....
I only meant that "independent/dependent" variable labeling is sort of buried in one's approach and may not help illuminate the issues because that, by itself, can be a source of confusion.  It's good that you've provided an example with real variables.

I can't tell if your so-called "dependent variable" isn't just another variable (which you might call "independent") ... that's my problem here.  I sense this may be a confusion.

Since I imagine that you're doing all this numerically then I envision that the data sets have to match in number - because you're going to do a single-point correlation to get a correlation coefficient.  You don't get that if the number of points in each are different.  Said another way, having more points in one than the other just throws out the extra points.  Correlation just multiplies point-by-point and then sums the products.  While this may seem a detail, it may have a lot to do with how you think about the data sets you will generate.
ExpExchHelpAuthor Commented:
Ok... thanks for the info.   Closing this post now.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.