Link to home
Start Free TrialLog in
Avatar of musickmann
musickmann

asked on

Count occurrence of pattern in excel column

I'm looking for a way to look through a column and determine the occurrences of specific patterns. The file contains test answer keys with several columns, the important ones are:
question number, answer, test name
As the file contains a list of many tests, the results need to be based per test, sort of like the subtotal option for "at change in X column"

There are two main patterns we need to identify:
- How many times answer X occurs consequentially
- How many times each answer choice occurs in the test

I would think maybe highlighting any time a specific answer occurs 4 or more times in a row would be one good output. And at the end of each test, a row that lists the counts of each answer choice.

At the moment, the answer choices are stored as numbers (ordinal rather than letters) as text. I've got a simple nested if formula to translate the numbers to corresponding letters, I would plan on running that prior to running the pattern analysis.
Avatar of Doug
Doug
Flag of United States of America image

The first one is pretty simple.  To count the instances of "Skipper" in column A, use this function
=COUNTIF(A:A,"Skipper")

I'll try to get something about the second one together.
If you don't have a lot of answer choices, you can use multiple CountIf statements like I did above.  That would be the quickest way.  Aside from writing a VB module or using a PowerPivot table, dynamically getting the counts might be difficult.

Column A          Column B
[Answer1]          =COUNTIF(A:A, $A1)
[Answer2]          =COUNTIF(A:A, $A2)
[Answer3]          =COUNTIF(A:A, $A3)
[Answer4]          =COUNTIF(A:A, $A4)
[Answer5]          =COUNTIF(A:A, $A5)
[Answer6]          =COUNTIF(A:A, $A6)
Here's a VBA function I wrote that'll do what you need.  I included comments inline.

Sub CountUniqueInstances()
    Dim Col As New Collection
    Dim itm
    Dim i As Long
    Dim CellVal As Variant
    
    'Get the last row which has active data
    TheLastRow = ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Row
    
    'Loop through the single column of each row adding unique values to the collection
    For i = 1 To TheLastRow
        CellVal = Sheets("Sheet1").Range("A" & i).Value
        
        If CellVal <> "" Then
            On Error Resume Next
            Col.Add CellVal, Chr(34) & CellVal & Chr(34)
            On Error GoTo 0
        End If
        
    Next i

    'Each value will be written on row 1 starting at column B
    ColumnIndex = 2 'if you want the column to be C, change this to 3, for D, change this to 4, etc.
    For Each itm In Col
        'count the occurrence of each unique value found
        iVal = Application.WorksheetFunction.CountIf(Range("A:A"), itm)
        Cells(1, ColumnIndex) = itm & ":" & iVal
        ColumnIndex = ColumnIndex + 1
    Next
End Sub

Open in new window

Avatar of musickmann
musickmann

ASKER

Thanks. The countif option is something I started with, and I thought would get me part of the way.

so all the answer choices are 1, 2, 3, or 4, which correspond to A, B, C, D

Since the file I have contains hundreds of tests with about 50 questions each, I'm looking for a way to do this en mass.

You would be a true savior if there was a way to create a new sheet that had a summary of some kind.

I'm attaching a file that contains a single test of dummy data, I added a summary sheet as an example of what would be ideal. The "Odd pattern" column would refer to something I could specify somewhere. I would essentially just make a list of what I would consider odd patterns since that is more complicated I know. But the counts would be the first step across the entire document.
I'm not sure I follow the script you have.

I changed line 12 to   Range("C" & i) since I have the data I need to count in column C.
I changed line 23 to ColumnIndex = 3 for the same reason
I cahnged line 26 to Range(C:C) again for the same reason

Maybe I shouldn't have adjusted those? The result I get in the first row doesn't seem to make sense.
This is Column C that I'm trying to count and what came up in the first row when running the script:

correct_answer_ordinal:1
correct_answer_ordinal
3
3
4
3
3
4
I don't think the attachment got included.
Sorry, I guess I didn't click the Upload button after selecting the file. doh.
AnswerKeyProject.xlsx
Maybe try to set the CountIf statements on the Summary Sheet.  It'll automatically update based on the values.

If you add another sheet (i.e. Test), you just have to copy\paste a new row to the summary sheet and edit the sheet name in the 4 formulas.
AnswerKeyProject.xlsx
That is a great simple solution for the counting of the occurrences. Right now I have about 300 tests or so in a single sheet, is there some way to use a script to run through the sheet, then automatically fill in the summary page, making an entry each time the test name column changes? I thought of the subtotal feature, but I don't know how to make that work counting the occurances of the 4 different answer options in the single column.

Do you have any thoughts on identifying patters? The most critical pattern would be to identify when there are 4 or more of the same answer consecutively.
I'm sure a script could be written. Give me a little bit to see what I can come up with.  So just to be clear, there is only 1 sheet with all the tests.  Do you have a more accurate example I can play with?
Thanks!

Here is a more complete sample set.
AnswerKeys.xlsx
I'm glad I asked for an updated version because the PP column wasn't on the first version.

Here's the scoop.  I got a working version.  
1. Open the excel doc I've attached.  If you get an "Enable Macros" warning, click the "Enable Content" button.  Because there's code running behind it, it'll give you this warning unless you change the security settings.  
2. Next, Press "Alt" + F8 to open the macro window
3. Run the "CalculateTestResults" macro
4. Check out the results on the "Summary" tab.
5. Change one of the tests to have 4 concurrent values.  Rerun the macro
     The OddPatterns column will now show a "Y" and be highlighted.

Here are the instructions going forward:
1. Unless you're familiar with creating Excel VB Macros, you'll have to copy your data into this worksheet I provided.
2. Copy the data into the "Tests" sheet of the workbook.
     PLEASE NOTE, the code I wrote is dependent on the column ordering.  If you add, change, or remove columns, someone will have to change the code.
3. Press ALT + F8 to open the macro window.
4. Run the "CalculateTestResults" Macro
5. Check out the results on the summary tab.

Enjoy!
AnswerKeys.xlsm
So, I'm getting an error:
Run-time error '1004'
Application-defined or object-defined error

The debug option points to this line:
Sheets("Summary").Cells(TestNumber + 2, 1).Value = "Test " & TestNumber

The only difference in data from the sample file is the test name perhaps? The names can contain spaces, dashes, periods, and underscores. I think they are limited to 100 characters in length. Maybe that has something to do with it?

The export file has so much extra info, I tried removing the extraneous stuff so it didn't cause confusion, so the PP column is just points possible on the item.

While I can't write macros, I do have some limited programming skills, most of which is being able to read through and make small changes.

I'm going to try my hand at tracking down the error. I am running on a Mac, and also tried using a windows VM, so not sure if that has anything to do with this.
Did the file work the first time you ran it before you made changes?  The problem is based on what you gave me, i assumed all the test names would be "test x".  Hence my request for a more real-life sample set.  :-)  Now that you said its not always the case, there is an issue.  I may be able to tweak the code to look at the previous test name and compare it to the current one to see if they are different.  Lemme take another look.
Yeah, I was afraid of that. I had to be careful given the source of the data is secure testing material, and since these posts can become publicly available, I was trying to keep things generic as possible.

The number of questions per test is not consistent either, so one test may have 50, one may have 40, etc. I think what I can do is manually replace the test names to match the text x expected, then reconnect the official names after running the analysis.
The question count won't matter, the code will handle that. I'll see if I can make the test names more flexible. If it's a ton of work I'll let you know.  You can send me a clear example to my personal email if you want. It'll definitely make testing the new code easier.  Feedback@DRECustom.com
Well, here is an exact data dump, I did a replacement of test names, so this is the full file I'm working with, same error.

This is the file, after running the script, so you can see the summary results in a full page of 0s and a large file.


Just for reference-
I copied the test names  to a new sheet, removed duplicates
In the next column, names each test consecutively
Did a lookup formula to attach the new test name to the original file, then replaced the original file test names with the new ones.
After the summary, I can easily do another lookup on the list to get the original names back
AnswerKeys-1.xlsm
ASKER CERTIFIED SOLUTION
Avatar of Doug
Doug
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Absolutely awesome. You have saved me so much time!