Link to home
Start Free TrialLog in
Avatar of Steven Welton
Steven Welton

asked on

Cleaning data in Excel

I have a large dataset (30K rows) generated from an access database, which is imported into an Excel spreadsheet on a quarterly basis.  They are a series of 24 questions from multiple surveys in the set.  All of the answers are "Yes", "No", or NA.

Within the question set there are internal logic constructions, like  If question 2 is No, Questions 3 and 4 MUST be NA, or if question 2 is Yes, then Question 3 MAY be Yes and then Question 4 Must be NA, OR Question 3 MAY be No, and then Question 4 MUST be No.

And so it goes.

I've been attempting to develop a VBA or Macro to automate this cleaning but am struggling to think through how to structure this so it goes through beginning to end of all of the surveys and evaluating the question sets within the survey numbers.

Any thoughts on this would be greatly appreciated.
Avatar of Dustin Saunders
Dustin Saunders
Flag of United States of America image

Got a sample file?
Avatar of Steven Welton
Steven Welton

ASKER

Sure.  Attached is a small sample file.
C--Users-swelton-Documents-Data-Cle.xlsx
Ok, so using Respondent One as an example, what should the output look like?  Can you mock up what that your expected results would be?
If Question 1 is No, then Question 2 Must be No and 3 and 4 are NA
If Question 1 is Yes, then Question 2 may be yes or no.
Question 1 cannot be NA
If Question 2 is No, then 3 and 4 Must be NA
If Question 2 is Yes, then 3 maybe be Yes, No, or NA
If Question 3 is Yes, then Question 4 MUST be NA
if Question 3 is No, then Question 4 MUST be No
If Question 3 is NA, then 4 MUST be yes

Question 5 Must be Yes or No

The rest of the set is similar in scope of multiple options based on other responses.
I still don't understand what you need to accomplish.  What does 'clean up' mean? :

If the answers don't match the criteria you want to update them as you mention here?

Or are you only looking to select respondents who meet this criteria?
Cleaning up means automating the analysis of the scoring to make sure the responses follow the above scoring logic.  Currently it's being done manually using a Pivot Table and takes hours to clean.  So, the analysis would identify the exceptions that need to be reviewed and fixed, or potentially fix.
You should include such logic in SQL query you run in access to export XLSX
Probably you need a structure:
RESPONDENT:answer1 answer2 ... etc
ASKER CERTIFIED SOLUTION
Avatar of Dustin Saunders
Dustin Saunders
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for your input.  I appreciate it.
This is not something you would ever do in Excel and I'm not sure why the application forms where the data is input doesn't validate at that point.

Do you currently have documentation that describes the rules?  If so, please post it.

If you can codify the rules and create tables that will help you to apply them, much of this can be done with queries.  If you cannot create a schema that lets you tabulate the rules, then you are reduced to VBA code.  Or, you might be able to do some of each and that will reduce the rules that need to be hard-coded.  Hard-coding the rules is the least desirable solution since it will be difficult to test and difficult to maintain.  For the parts that you need to drop into VBA to apply, use Case statements rather than nested If's.

Rather than trying to create a top-down schema to apply the rules, think about the possibility of looking up one level since that might simplify the task dramatically.  The bottom up approach will actually be fairly easy as long as any given question is controlled only by a single other question.  So - bottom up, question 45 is controlled by question 40 but top down question 40 controls questions 41-52, 99, and 106-109.  So the top down schema will be more complex.
Yes, hard coding the rules is not ideal--  what you could do is create a 'AnswerKey' table that contains the fixes for easy editing.

Something like this with 4 columns:

AnswerNumber,CheckNumber,CheckValue,ForcedAnswer

Where AnswerNumber is the current question you are on, CheckNumber is the question you want to reference, CheckValue is the conditional (if the value is this) then ForcedAnswer is what you change it to.

AnswerNumber,CheckNumber,CheckValue,ForcedAnswer
2,1,aYes,bNo
3,2,cNA,bNo

Open in new window


The above would say "for question 2, if question 1 was answered 'aYes' force this to be 'bNo'" then "for question 3, if question 2 was answered 'cNA' force the answer to be 'bNo'"

You can edit that 'AnswerKey' table then with another form in your Access database.
It's an interesting concept.  The challenge we have with this particular application is that the surveys (there are over 80 distinct versions) are dynamically built using junction tables.  The good news is that the question sets remain pretty consistent across the survey sets, but they are loaded in a temporary table and displayed from there.  The questions all load from a Question Table, so the Question Number and the Display number are different, though the relationships for the logic remain constant.  So, can I write a dynamic version of the verification that could still look at the Answer Key Table in all instances?  Then, if the Question Number is not in the table it would just do the Yes/No/NA, but if it is, it would look into the Temp table to see the score of the related questions, see the value in the Response Field and control what could be displayed?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial