Link to home
Start Free TrialLog in
Avatar of billb1057
billb1057Flag for United States of America

asked on

Excel - Find nearest words for top ranked (text mining)

This might require a solution that is "good enough" or a few steps in the right direction, rather than a silver bullet that will solve everything.  

I am looking for a simplistic way to do some text analysis in Excel (or it could be in MS Word - or another application if that would work also).
The task is like this -- in a document with open-text survey responses (strings of text for each row)
1.  After counting all the words
2.  Omitting trivial words (the, a, and, it), rank the top results
3.  Take the top results (e.g. top 20) and determine which non-trivial words appear in the string with it.  For example:  Word "service" is ranked #5 in word-count.  When service is mentioned, the top words mentioned with it are:  "great", "efficient", "friendly"

Here's one approach I have used.

1.  convert data to single column (one word per cell)
2.  Use Pivot table to count words and sort descending
3.  use formulas to create phrases of two-words and three-words (concatenate words back together from original column of single words).
4.  Then, find top words and see what phrases they appear in.
5. then manually pick through them and count which other words look like they're common within 3 or 4 words of the target (?????)

That is a messy solution.

To summarize -- what I'm looking for is a way to (perhaps) tag the target words (top 20 most used, for example).    Then find a way to measure the distance between those target words and other words in the same strings (in a paragraph).  I thoujght about converting the words to variables and then doing some math to find out which variables are closest????  

There is commercial software that does this, but I'm looking for a simplistic solution in Excel -- and a chance to try some more innovative string manipulation, possibly.
Thanks for giving this a try.
Avatar of Lorenda Christensen
Lorenda Christensen
Flag of United States of America image

I'm thinking two loops:

First loop reads in the text string to an array (ignoring trivial words)

Some code after the first loop runs through the array, counting words, and getting top 20

Second loop takes top twenty words, and scrolls back through the array to pick up the next x before and after the top twenty word and writes them to a second file.

I'm not very familiar with Word, but doesn't it have a summarize tool? Have you seen what type of results you get with that?
Avatar of billb1057

ASKER

lorenda -- thanks for a good start.  Could we take it a step farther?
Let's say, we have the word "service" as one of the top 20 counted.  Now could find all the "before and after" words (let's extend it to 5 before and 5 after).
Then we want to count them somehow.
Perhaps a loop that counts all the words that are before or after the target. Then, count all of those words.
Target: "service".
Code:  If string contains "service" return five words before and five words after.  Then,  count all of the words surrounding the target.
The end result would be:
"Service" -- is surrounded by these ten words (...) ranked from most-frequent to least-frequent.
 
I just tried the Word AutoSummarize and that's sort of interesting, but it doesn't really do that much (that I can see) and it's difficult to control the results.
Another option is to get some open-source software for text mining and then figure out how to use it, but this task is an attempt to avoid that.  :-)
Thanks again.
Here's another thought:
Somehow tag the position of each word in the string (maximum of 5 before or 5 after) surrounding the target word (top 20).
Then count the words occurring in each position.
1.  Find position of target word in string.
2.  Determine how long the string is and get (up to max 5) words before and after target.
3.  Count words in each position
4.  Display in bubble chart with target word as biggest bubble, and then next largest count and position of words as smaller decreasing bubbles.
So, perhaps 10 columns, one for each of 5 positions before and after the target.
That is messy though because with 20 target words that is 200 columns wide.
Hmm, I'll have to write this up to see if I can get "meaningful" information. Let me play around with it.
ASKER CERTIFIED SOLUTION
Avatar of Neil Fleming
Neil Fleming
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Neil -- that is amazing.  It's going to take me a while to figure out what you did, but it works perfectly.  You are a genius!!  :-)
Thank you for the kind comment. Was fun to do.

I notice there is some punctuation I didn't zap, like ":", "?" etc

You might want to have a list of the punctuation in the "ignore" sheet too and modify the replace code to use that list, modelling it on the code that zap the ignore words.
I also caught parentheses and added them to the replace
allText = Replace(allText, ")", " ")

allText = Replace(allText, "(", " ")
How would I modify the code to pick up 3 words before or after (or possibly all words in the string??).  
I was thinking about weighting the words based on how close they are to the target, but I can see that's not necessary since my text snippets are very much like the samples you provided -- short survey responses.  So, if a word shows up in the string it doesn't make a difference if it's in the second position or fourth position -- it's in the same comment and that's all that counts.
I will ask a related question in a couple of days (I think EE will notify you) regarding how to chart this also.
Many thanks for a very innovative approach.  I've looked around the web and I haven't seen anything in Excel that can do this (so maybe you can patent it).  :-)
lorenda -- thanks also for your help.
You can change the number of words it picks up in the "ignore" sheet.. the number in cell G1 determines how far it looks.
I saw that and then forgot.  :-)  That's another great feature.  Thanks again.
Neil - Does the Words variable contain the entire text or just the string on each row?  I'm seeing some results pick up words from a row above or below the first counted word.  That would work if the text was one long paragraph (divided by sentence for each row), but in my case, each row is a survey entry so they don't go together with the row above and the row below.
Sorry for the slow response... You're right, it is currently picking up words from previous rows because it was useful to have all the words in a single array.

I'll put some markers in to split it up and have the adjacent word search stop when it hits a marker.
Try this. I added a text marker ">>" between responses and rewrote to stop the AddNearby function if it encounters one of these. Not super-elegant, but it works...



Also added a second key to the sort to sort a) by hits and b) alphabetically by word.

TextMiner.xls
That works really well now.  Thanks!
This line:
allText = allText & " >> " & rRow.Cells(1, 1).Value
is where you put the ">>" at the end of each row, right?  That's a great solution.  
I'll keep your prior version also for use if there is one long document by a single author.
You could also rewire it to handle one line at a time, processing the "nearby" words as you go. There is no real need to glue them all together -- I think I did that because I was going to approach it a different way initially.
Would that work by inserting a stop character at various points?
Neil,
Are you available for a new related question (with new points, etc)  to modify the code a bit?
Bill B
Yes.

Actually, have been using the code myself and have made some modifications as attached.

This version addresses some crashes which occur if you feed the workbook a long document and wind up with more "adjacent" words than Excel's maximum number of columns (which is 256 -- so can only have 127 adjacent words).

If the limit is reached, the code now begins a new row for the primary word. It also assigns the total count of the primary words to the "count" column A for both (or more) copies of the primary word.

I also took out the clunky ">>" workaround, and had the code process each line of the raw sheet individually. For very long files, there is now also a counter on the raw sheet indicating progress.

I have tested by pasting an entire novel into the raw worksheet. Seems not to break.  :)



TextMiner2.xls
Yes.

Have actually modified it since then, and attach the results. It was crashing once it reached excel's column limit (256) when displaying associated words from large documents.

This version starts a new line for the same "primary word" when it reaches the limit, and then sums the totals for the primary word.

It also now handles one line of raw input at a time, instead of all the text at once, so doesn't need the ">>" delimiter any more.

Seems to be robust, even when I tried pasting a whole novel into the "raw" sheet.
TextMiner2.xls