Link to home
Start Free TrialLog in
Avatar of billb1057
billb1057Flag for United States of America

asked on

Excel - Text Parsing, separate words then rejoin them by 2-word and 3-word combinations

This might be a challenge, so I gave it 500 points.  Or it might be easy, I really don't know.

I am trying to do some Text Mining, or Text Analytics on survey comments using MS Excel -- starting with the simplest concepts of separating the text into single words (in sequence) and then looking at two-word and three-word combinations in the text.

The attached Excel file gives some text (feel free to use your own if it might be a better example) with the columns started.

I can somewhat do this procedure in MS Word and then with Concatenate and Pivot Tables back in Excel.
In Word, I dump all the text, then remove all periods and commas.  But first, I replace all periods with a paragraph mark (to get the end of a sentence).  Then I replace all spaces (" ") with a paragraph mark.  This creates a single column of words, in sequence -- with two paragraphs at the end of each cell.
Then with pivot tables I can count the words.  With Concatenate, I can add one cell to the following cell and get two-word combinations and do the same for three-word combinations.  These can be counted (with Pivot Tables) to find key themes in the text.

Now -- to do all of that in Excel with Code???
I have found the attached code which does a nice job of Counting Instances of Words in the text.  No matter what I've tried though, I cannot get this to just display the single words in sequence though ( and I don't understand how the range works).
So, this might be a helpful starting point???

If not, and you're up for the challange -- could you try to write the code that would parse the strings into single words in a column, and then join them in two-word & three-word combinations?

Thanks very much.
Sub CreateUniqueWords()
    Dim rngData As Range
    Dim rngCell As Range
    Dim colWords As Collection
    Dim vntWord As Variant
    
    On Error Resume Next
    
    Set colWords = New Collection
    Set rngData = Range("B30:B86")
    For Each rngCell In rngData.Cells
        For Each vntWord In Split(Replace(Replace(Replace(rngCell.Value, """", ""), "]", ""), "[", ""), " ")
            colWords.Add colWords.Count + 1, vntWord
            With Cells(29 + colWords(vntWord), 3)
                .Value = vntWord
                .Offset(0, 1) = .Offset(0, 1) + 1
            End With
        Next
    Next
    
    With Range("C30", Cells(Rows.Count, 3).End(xlUp)).Resize(, 2)
        .Sort .Cells(1, 2), xlDescending
    End With
    
End Sub

Open in new window

TextParsing.xls
Avatar of tilsant
tilsant
Flag of India image

chk out the file attached.
I have used formula.
From E2:V28, same formula have been used, so can be dragged anywhere!


Tils.
TextParsing.xls
Avatar of billb1057

ASKER

Wow -- that is really nice with just formulas.  
Ok, now the only thing missing is the shift to the next word.
Here we have the first word, then the first and second, then the first second and third.  So that is great.
But, what we need also is the first, second and third -- then the second, third and fourth, then the third, fourth and fifth.  So, there are 3-word combinations for the string with different starting points.
Could we just use an OFFSET somewhere in that?  So, you get the first word, then OFFSET to get the second, third, etc.  Then do the same with the 2-word combinations. I am just guessing.
The goal is to find each two or three word combination.
So the sentence.  "The calculations are always inaccurately presented" would give:
The calculations are
calculations are always
are always inaccurately
always inaccurately presented

Notice the first words in the left column are the sentence.  So now, these phrases could be counted (if there were 2000 entries, for example), and the phrase "always inaccurately presented" -- would be a very important insight.
Could you give that a try?  This is great work so far.
im kinda busy now and im not finding an easy way out!
gimme some time....

i'll get back to u 2moro.......... hope its not that very urgent.


Tils.
That would be great.  Not urgent -- try what you can and let me know.
I think we might have to use the VBA command "Split" somehow though -- to split each String into separate Words and then join them back together.
If we can do the same thing in a Formula that would be even better.   I think Split to make a single column of all the words, then Concatenate A2 & A3 & A4;  A3 & A4 & A5;  A4 & A5 & A6 ... that would work.  But then we have to avoid making the end of one sentence the beginning of another.  So, if there is a space after each sentence then an IF statement which avoids the Null cell -- or something like that.
well yea........... actually thats a nice idea!
though im not a VB Programmer........... i can try implementing your suggestion by using formulas.


 
ASKER CERTIFIED SOLUTION
Avatar of tilsant
tilsant
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That was a good idea.  Text-to-columns with the " " delimiter breaks apart the text in each cell and then you put them back together in groups of three and move to the next cell.
Very nicely done.  It works great -- thanks!!!
 
Recording a Macro is such a great tool for me these day ;)


Cheers,
Tils.
Can you figure out how to skip a blank cell?  I couldn't find a way to do that by recording.  
what exactly do u mean by skipping a blank??
Could u post some sample file?
This is the same file but with one row blank (the code is in the sheet).
I think it just needs something like
IF active.cell ISNULL - Then skip to next one.
Or something like that.  ???
What happens is that I get a file and some rows have text and some don't.

TextParsing2.xls
would it be fine if we remove the blanks from the Column A??
I think that's good enough.  :-)
Cheers,
Bill
here's your updated code.
Sub Macro1()
    Application.ScreenUpdating = False
    Columns("A:A").Select
    Selection.SpecialCells(xlCellTypeBlanks).Select
    Selection.EntireRow.Delete
    Range("A2").Select
    For i = 1 To 27
    Selection.Copy
    ActiveCell.Offset(0, 1).Range("A1").Select
    ActiveSheet.Paste
    Application.CutCopyMode = False
    Selection.TextToColumns Destination:=ActiveCell, DataType:=xlDelimited, _
        TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, Tab:=True, _
        Semicolon:=False, Comma:=False, Space:=True, Other:=False, FieldInfo _
        :=Array(Array(1, 1), Array(2, 1), Array(3, 1), Array(4, 1)), TrailingMinusNumbers:= _
        True
    ActiveCell.Offset(1, 0).Range("A1").Select
    ActiveCell.FormulaR1C1 = "=TRIM(R[-1]C&"" ""&R[-1]C[1]&"" ""&R[-1]C[2])"
    ActiveCell.Select
    Selection.Copy
    ActiveCell.Range("A1:S1").Select
    ActiveSheet.Paste
    Application.CutCopyMode = False
    Selection.Cut
    ActiveCell.Offset(-1, 0).Range("A1").Select
    ActiveCell.Offset(1, 0).Range("A1").Select
    Range(Selection, Selection.End(xlToRight)).Select
    Application.CutCopyMode = False
    Selection.Copy
    ActiveCell.Offset(-1, 0).Range("A1").Select
    Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
        :=False, Transpose:=False
    ActiveCell.Offset(1, 0).Range("A1").Select
    Range(Selection, Selection.End(xlToRight)).Select
    Application.CutCopyMode = False
    Selection.ClearContents
    ActiveCell.Offset(0, -1).Range("A1").Select
    Next
    Range("A1").Select
    Application.ScreenUpdating = True
End Sub

Open in new window

It works great  - thanks again!