Excel - Find nearest words for top ranked (text mining)

This might require a solution that is "good enough" or a few steps in the right direction, rather than a silver bullet that will solve everything.  

I am looking for a simplistic way to do some text analysis in Excel (or it could be in MS Word - or another application if that would work also).
The task is like this -- in a document with open-text survey responses (strings of text for each row)
1.  After counting all the words
2.  Omitting trivial words (the, a, and, it), rank the top results
3.  Take the top results (e.g. top 20) and determine which non-trivial words appear in the string with it.  For example:  Word "service" is ranked #5 in word-count.  When service is mentioned, the top words mentioned with it are:  "great", "efficient", "friendly"

Here's one approach I have used.

1.  convert data to single column (one word per cell)
2.  Use Pivot table to count words and sort descending
3.  use formulas to create phrases of two-words and three-words (concatenate words back together from original column of single words).
4.  Then, find top words and see what phrases they appear in.
5. then manually pick through them and count which other words look like they're common within 3 or 4 words of the target (?????)

That is a messy solution.

To summarize -- what I'm looking for is a way to (perhaps) tag the target words (top 20 most used, for example).    Then find a way to measure the distance between those target words and other words in the same strings (in a paragraph).  I thoujght about converting the words to variables and then doing some math to find out which variables are closest????  

There is commercial software that does this, but I'm looking for a simplistic solution in Excel -- and a chance to try some more innovative string manipulation, possibly.
Thanks for giving this a try.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Lorenda ChristensenAccounting and VBA ConsultantCommented:
I'm thinking two loops:

First loop reads in the text string to an array (ignoring trivial words)

Some code after the first loop runs through the array, counting words, and getting top 20

Second loop takes top twenty words, and scrolls back through the array to pick up the next x before and after the top twenty word and writes them to a second file.

I'm not very familiar with Word, but doesn't it have a summarize tool? Have you seen what type of results you get with that?
billb1057Author Commented:
lorenda -- thanks for a good start.  Could we take it a step farther?
Let's say, we have the word "service" as one of the top 20 counted.  Now could find all the "before and after" words (let's extend it to 5 before and 5 after).
Then we want to count them somehow.
Perhaps a loop that counts all the words that are before or after the target. Then, count all of those words.
Target: "service".
Code:  If string contains "service" return five words before and five words after.  Then,  count all of the words surrounding the target.
The end result would be:
"Service" -- is surrounded by these ten words (...) ranked from most-frequent to least-frequent.
billb1057Author Commented:
I just tried the Word AutoSummarize and that's sort of interesting, but it doesn't really do that much (that I can see) and it's difficult to control the results.
Another option is to get some open-source software for text mining and then figure out how to use it, but this task is an attempt to avoid that.  :-)
Thanks again.
OWASP Proactive Controls

Learn the most important control and control categories that every architect and developer should include in their projects.

billb1057Author Commented:
Here's another thought:
Somehow tag the position of each word in the string (maximum of 5 before or 5 after) surrounding the target word (top 20).
Then count the words occurring in each position.
1.  Find position of target word in string.
2.  Determine how long the string is and get (up to max 5) words before and after target.
3.  Count words in each position
4.  Display in bubble chart with target word as biggest bubble, and then next largest count and position of words as smaller decreasing bubbles.
So, perhaps 10 columns, one for each of 5 positions before and after the target.
That is messy though because with 20 target words that is 200 columns wide.
Lorenda ChristensenAccounting and VBA ConsultantCommented:
Hmm, I'll have to write this up to see if I can get "meaningful" information. Let me play around with it.
Neil FlemingConsultant and developerCommented:
Try the attached and see if it helps.

Workbook has three sheets: "raw" "results" and "ignore"
Paste your source sentences as indicated into the "raw" sheet. Then hit the "mine text" button.

The code in the workbook parses all the text in the raw sheet; zaps all words listed in the "ignore" sheet; splits what's left into an array and then spits it out in the "results" sheet as follows:

1. It counts occurrences of words in column A
2. In the columns to the right of the main word list it gives all words occurring within X words to left and right of the main word

Value of X is also set in the "ignore" sheet.

The code then sorts the results.

There are three subroutines -- the main miner, a routine to add the nearby words, and a custom sort routine to deal with the "double columns" involved in listing words and their frequency.
Option Explicit
'written by Neil Fleming

Sub MineText()

Dim wb As Workbook, wsRaw As Worksheet, wsResults As Worksheet, wsIgnore As Worksheet

Dim rSource As Range, rRow As Range, rResults As Range, rMatch As Range
Dim allText As String
Dim unWanted As String
Dim Words() As String

Dim aWord As Integer, aRow As Integer, Nearby As Integer, maxCol As Integer

On Error GoTo errortrap

'set up worksheets
Set wb = ActiveWorkbook
Set wsRaw = wb.Sheets("raw")
Set wsResults = wb.Sheets("results")
Set wsIgnore = wb.Sheets("ignore")
'set rSource to  non-blank rows
Set rSource = Range(wsRaw.Range("A5"), wsRaw.Range("a5").End(xlDown))
allText = ""
For Each rRow In rSource.Rows
allText = allText & " " & rRow.Cells(1, 1).Value
'add trailing space:
allText = allText & " "

allText = UCase(allText)
'take out punctuation:
allText = Replace(allText, ",", " ")
allText = Replace(allText, ";", " ")
allText = Replace(allText, ".", " ")
allText = Replace(allText, " - ", " ")

'remove ignored words, using list from "ignore" worksheet
For aWord = 2 To wsIgnore.Cells(2, 1).End(xlDown).Row
unWanted = wsIgnore.Cells(aWord, 1).Value
While InStr(allText, " " & unWanted & " ") > 0
allText = Replace(allText, " " & unWanted & " ", " ")

'take out double spaces:
While InStr(allText, "  ") > 0
allText = Replace(allText, "  ", " ")

'remove leading and trailing spaces:
allText = Trim(allText)

'convert AllText to array of words:
Words = Split(allText, " ")

'blank results:

With wsResults
Application.ScreenUpdating = False

.Range("a1", "dd1000").Interior.Color = RGB(255, 255, 255)
.Cells(1, 1) = "Count"
.Cells(1, 2) = "Word"

'insert unique word results in results sheet:
aRow = 2
maxCol = 1
For aWord = 0 To UBound(Words)
'redefine results range:
Set rResults = .Range(wsResults.Range("b2"), .Range("b2").End(xlDown))
Set rMatch = rResults.Find(Words(aWord))

'if word in list, increment count:
    If Not rMatch Is Nothing Then
    rMatch.Offset(0, -1) = rMatch.Offset(0, -1) + 1
    'otherwise add to list:
    Set rMatch = wsResults.Cells(aRow, 2)
    rMatch.Value = Words(aWord)
    rMatch.Offset(0, -1).Value = 1
    aRow = aRow + 1
    End If
    'examine nearby words, count occurrences and add to right of main word:
    'number of words examined is set on "ignore" sheet in cell named "ptrMaxWords":
    For Nearby = 1 To Range("ptrMaxWords").Value
    'go backwards:
     If aWord - Nearby >= 0 Then AddNearby Words(aWord - Nearby), .Cells(rMatch.Row, 1), maxCol
    'go forwards:
    If aWord + Nearby <= UBound(Words) Then AddNearby Words(aWord + Nearby), .Cells(rMatch.Row, 1), maxCol
    Next Nearby
Next aWord

'adjust column widths

'sort rows using CustomSort sub below:
For Each rRow In rResults
CustomSort rRow
Next rRow

'sort columns
Set rResults = .Range("a1", .Cells(rResults.Rows.Count + 2, maxCol))
rResults.Sort .Range("a1", rResults.Range("a1").End(xlDown)), xlDescending, , , , , , xlYes, , , xlSortColumns, xlPinYin

    With rResults.Borders
    .LineStyle = xlContinuous
            .Weight = xlThin
            .Color = RGB(255, 140, 0)
    End With
.Range("A:B").Font.Bold = True

End With

Application.ScreenUpdating = True

End Sub

Sub AddNearby(aString As String, ByRef aRange As Range, maxCol As Integer)
Dim rcol As Range

Set rcol = aRange.EntireRow.Find(aString)
     If Not rcol Is Nothing Then
     'increase count:
     rcol.Offset(0, -1) = rcol.Offset(0, -1) + 1
     'add word
     Set rcol = aRange.End(xlToRight).Offset(0, 1)
     rcol.Value = 1
     rcol.Offset(0, 1) = aString
     End If
     'adjust value of the rightmost altered column:
     If rcol.Column + 1 > maxCol Then maxCol = rcol.Column + 1 Else maxCol = maxCol

End Sub

Sub CustomSort(aRange As Range)
'sorts pairs of cells along row based on the value of the first cell in the pair:
Dim StartCol As Integer, swapCol As Integer, swapValue As Integer
Dim swapText As String

StartCol = 2
    swapCol = StartCol + 2
        If aRange.Cells(1, swapCol) > aRange.Cells(1, StartCol) Then
        'swap cells and next cell along (containing word) with each other:
        swapValue = aRange.Cells(1, swapCol)
        swapText = aRange.Cells(1, swapCol + 1)
        aRange.Cells(1, swapCol) = aRange.Cells(1, StartCol)
        aRange.Cells(1, swapCol + 1) = aRange.Cells(1, StartCol + 1)
        aRange.Cells(1, StartCol) = swapValue
        aRange.Cells(1, StartCol + 1) = swapText
        End If
    swapCol = swapCol + 2
    Loop Until aRange.Cells(1, swapCol) = ""
StartCol = StartCol + 2
Loop Until aRange.Cells(1, StartCol) = ""

End Sub

Open in new window


Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
billb1057Author Commented:
Neil -- that is amazing.  It's going to take me a while to figure out what you did, but it works perfectly.  You are a genius!!  :-)
Neil FlemingConsultant and developerCommented:
Thank you for the kind comment. Was fun to do.

I notice there is some punctuation I didn't zap, like ":", "?" etc

You might want to have a list of the punctuation in the "ignore" sheet too and modify the replace code to use that list, modelling it on the code that zap the ignore words.
billb1057Author Commented:
I also caught parentheses and added them to the replace
allText = Replace(allText, ")", " ")

allText = Replace(allText, "(", " ")
How would I modify the code to pick up 3 words before or after (or possibly all words in the string??).  
I was thinking about weighting the words based on how close they are to the target, but I can see that's not necessary since my text snippets are very much like the samples you provided -- short survey responses.  So, if a word shows up in the string it doesn't make a difference if it's in the second position or fourth position -- it's in the same comment and that's all that counts.
I will ask a related question in a couple of days (I think EE will notify you) regarding how to chart this also.
Many thanks for a very innovative approach.  I've looked around the web and I haven't seen anything in Excel that can do this (so maybe you can patent it).  :-)
billb1057Author Commented:
lorenda -- thanks also for your help.
Neil FlemingConsultant and developerCommented:
You can change the number of words it picks up in the "ignore" sheet.. the number in cell G1 determines how far it looks.
billb1057Author Commented:
I saw that and then forgot.  :-)  That's another great feature.  Thanks again.
billb1057Author Commented:
Neil - Does the Words variable contain the entire text or just the string on each row?  I'm seeing some results pick up words from a row above or below the first counted word.  That would work if the text was one long paragraph (divided by sentence for each row), but in my case, each row is a survey entry so they don't go together with the row above and the row below.
Neil FlemingConsultant and developerCommented:
Sorry for the slow response... You're right, it is currently picking up words from previous rows because it was useful to have all the words in a single array.

I'll put some markers in to split it up and have the adjacent word search stop when it hits a marker.
Neil FlemingConsultant and developerCommented:
Try this. I added a text marker ">>" between responses and rewrote to stop the AddNearby function if it encounters one of these. Not super-elegant, but it works...

Also added a second key to the sort to sort a) by hits and b) alphabetically by word.

billb1057Author Commented:
That works really well now.  Thanks!
This line:
allText = allText & " >> " & rRow.Cells(1, 1).Value
is where you put the ">>" at the end of each row, right?  That's a great solution.  
I'll keep your prior version also for use if there is one long document by a single author.
Neil FlemingConsultant and developerCommented:
You could also rewire it to handle one line at a time, processing the "nearby" words as you go. There is no real need to glue them all together -- I think I did that because I was going to approach it a different way initially.
billb1057Author Commented:
Would that work by inserting a stop character at various points?
billb1057Author Commented:
Are you available for a new related question (with new points, etc)  to modify the code a bit?
Bill B
Neil FlemingConsultant and developerCommented:

Actually, have been using the code myself and have made some modifications as attached.

This version addresses some crashes which occur if you feed the workbook a long document and wind up with more "adjacent" words than Excel's maximum number of columns (which is 256 -- so can only have 127 adjacent words).

If the limit is reached, the code now begins a new row for the primary word. It also assigns the total count of the primary words to the "count" column A for both (or more) copies of the primary word.

I also took out the clunky ">>" workaround, and had the code process each line of the raw sheet individually. For very long files, there is now also a counter on the raw sheet indicating progress.

I have tested by pasting an entire novel into the raw worksheet. Seems not to break.  :)

Neil FlemingConsultant and developerCommented:

Have actually modified it since then, and attach the results. It was crashing once it reached excel's column limit (256) when displaying associated words from large documents.

This version starts a new line for the same "primary word" when it reaches the limit, and then sums the totals for the primary word.

It also now handles one line of raw input at a time, instead of all the text at once, so doesn't need the ">>" delimiter any more.

Seems to be robust, even when I tried pasting a whole novel into the "raw" sheet.
Neil FlemingConsultant and developerCommented:
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Excel

From novice to tech pro — start learning today.