Solved

Words captured in different ways or formats

Posted on 2016-07-23
17
57 Views
Last Modified: 2016-08-10
Hi Experts

I have an excel file with large amount of data. I need a list words captured in different ways in a column of sentences.

This may be somewhat related to fuzzy matching, but I don't want similar matching words but exact words captured in different ways.

Example - Solar, SolAr, SOLAR
      pH, ph
      Inline, In-Line

I have attached one example to make my point clear.

Thanks in Advance
Words-in-Different-Formats1.xlsx
0
Comment
Question by:Vijayendra S. Murthy
  • 8
  • 7
  • 2
17 Comments
 
LVL 45

Expert Comment

by:Martin Liss
ID: 41725879
I assume that there are actually more than just two rows. Can you supply a larger sample with sample output?
0
 
LVL 48

Expert Comment

by:Rgonzo1971
ID: 41727072
Hi,

pls try
Sub macro()
Dim a, b
Dim aUnique
Set d = CreateObject("Scripting.Dictionary")
Sheets.Add
ActiveSheet.Name = "Tmp"
Sheets("Input").Activate
Sheets("Tmp").Cells.ClearContents
Sheets("Tmp").Range("A1:D1") = Array("1Col", "2Col", "3Col", "4Col")
Set Rng = Range(Range("B2"), Range("B" & Rows.Count).End(xlUp))
For Each c In Rng
    For Each Itm In Split(c, " ")
        Sheets("Tmp").Range("A" & Rows.Count).End(xlUp).Offset(1) = c.Offset(, -1)
        Sheets("Tmp").Range("A" & Rows.Count).End(xlUp).Offset(, 1) = Itm
    Next
Next

Set Rng1 = Range(Sheets("Tmp").Range("B2"), Sheets("Tmp").Range("B" & Rows.Count).End(xlUp))
Rng1.Offset(, 1).Formula = "=substitute(lower(b2),""-"","""")"
Rng1.Offset(, 2).Formula = "=IF(AND(COUNTIF(C:C,C2)>1,COUNTIF($C$2:C2,C2)=COUNTIF(C:C,C2)),1,NA())"
Set Rng2 = Rng1.Offset(, 2).SpecialCells(xlCellTypeFormulas, xlNumbers)
For Each Itm1 In Rng1.Offset(, 2).SpecialCells(xlCellTypeFormulas, xlNumbers).Offset(, -1)
    strUnique = strUnique & "|" & Itm1
Next
strUnique = Right(strUnique, Len(strUnique) - 1)
aUnique = Split(strUnique, "|")
Sheets("Tmp").Activate
With Range(Range("C1"), Range("C" & Rows.Count).End(xlUp))
    For Each itm2 In aUnique
        .AutoFilter Field:=1, Criteria1:=itm2
        Set Rng3 = .SpecialCells(xlCellTypeVisible)
        If Rng3.Cells.Count > 1 Then
            For Each c In Rng3
                If Not d.Exists(c.Offset(, -1)) And c <> "3Col" Then
                    d.Add c.Offset(, -1).Value, c.Offset(, -2).Value
                End If
            Next
        End If
        aItems = d.items
        aKeys = d.keys
        Sheets("Output").Range("A" & Rows.Count).End(xlUp).Offset(1) = Join(aItems, ", ")
        Sheets("Output").Range("A" & Rows.Count).End(xlUp).Offset(, 1).Resize(1, UBound(aKeys) + 1) = aKeys
        .AutoFilter Field:=1
        Set d = CreateObject("Scripting.Dictionary")
    Next
End With
Application.DisplayAlerts = False
Sheets("Tmp").Delete
Application.DisplayAlerts = True
End Sub

Open in new window

Regards
Words-in-Different-Formats2.xlsm
0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41731272
Thanks experts for your timely response

@Martin Liss - I have attached a new sample. Please have a look at it.

@Rgonzo1971 - Thanks for your help. I will look into the code and update about it.
Words-in-Different-Formats2.xlsx
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 41732111
I notice that on your Output sheet you show (for example) "water" in column "B" and "Water" in column "D" in several rows. Why?

And you also show words like "Potable" in the output and the capitalized word doesn't exist. Is that a mistake or don't I understand the requirements?

Edit:
Here's what I came up with by running the Words macro contained in the attached workbook. Before I did so I copied and saved your Output sheet as Output (2) and created a new Output sheet.
28959127.xlsm
0
 
LVL 48

Expert Comment

by:Rgonzo1971
ID: 41732406
@Martin
just a remark maybe you want to adapt your regex for
Inline, In-Line
.Pattern = "[^A-Za-z0-9\ -]"

Open in new window

0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41733033
@Martin - You have understood my requirement. Thank you, but can I get the part numbers also, it will be more useful. Actually I didn't give all the output. I have some code from our support guy, but it doesn't give the Part Numbers, also I don't have access to the macro.

Also, if we have the words 3-PORTMAN, 3-Portman, I get the output as 3Portman. I am not getting all the different type of words. Please go through and advise.

@Rgonzo1971 - your code worked, but for large data, excel gives out error that it cannot continue with the available resources. Also countif formula will make the file hang.

Thanks once again for all your help.
0
 
LVL 45

Accepted Solution

by:
Martin Liss earned 500 total points
ID: 41733202
Here's an update.
28959127a.xlsm
0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41734687
Thanks Martin. It is working fine and I am closing this question by selecting this as best solution.

But one small final request before closing, if for example I make the word ASTM D2609 in 2nd cell as ASTM d2609 this is not shown in output. I think it is very difficult in regex to get all these variable number of cases. You can take a look into this.

Thanks all experts for your wonderful solutions and articles. I am going through some of the excel & VBA articles here on EE which are very much helpful.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 45

Expert Comment

by:Martin Liss
ID: 41734865
Are you saying that you want to consider "ASTM D2609" as one word when it's really two?
0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41736423
@ Martin - No it is not showing for D2609 & d2609 only. I changed this in row 2 and ran the code.

Maybe we need to make a change in pattern which can accommodate all these types. But as I have said in my previous comment, I think it is difficult to get all these variable number of cases of words.

Thank you
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 41736525
Martin - No it is not showing for D2609 & d2609 only.
Yes, I assumed that that is what it was doing and I was asking if you considered "ASTM D2609" to be one "word" for the purposes of the code. I assume now that the answer is yes. Can you give me other examples of what these "words" look like? Is the first part always "ASTM" or is there some other commonality?
0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41747223
Sorry Martin, I was busy and had not logged in to see the comment, and for the above, I mean only for D2609 & d2609 and not ASTM D2609. I changed cell B2 to d2609 and cell C2 to D2609, and I got the output correctly. But if I download the spreadsheet you have provided, this is not shown. I think, this is to do with the format of the cell or the carriage return in cell.

Thank you
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 41747428
Line 25 (and 24) is new.
Sub Words()
Dim strParts() As String
Dim strDicWords() As String
Dim colWords As New Collection
Dim intPart As Integer
Dim intD As Integer
Dim dicWords
Dim cel As Range
Dim strText As String
Dim bFound As Boolean
Dim k As Variant
Dim lngNR As Long
Dim intCol As Integer

    Set dicWords = CreateObject("scripting.dictionary")

    With Sheets("Input")
        ' Create a "Dictionary" object that contains an entry for each word in the Input sheet starting at row 2.
        ' The Key will be the upper case value of the word and the Item value will be something
        ' like "and|And" which represents all existing forms of the word separated by the pipe symbol.
        ' There will only be one entry per unique upper case word.
        For Each cel In .UsedRange.Offset(1, 0).Columns("B").Cells
            strText = cel.Value
            ' Replace line feed characters with spaces
            strText = Replace(strText, Chr(10), " ")
            ' Remove punctation from the cell value
            strText = RemovePunctuation(strText)
            ' Splt the cell value into individual words, adding them to the
            strParts = Split(strText, " ")
            ' Look through the words in the cell and if they aren't in the dictionary, add them,
            ' and if they are in the dictionary only add them if the case is different.
            For intPart = 0 To UBound(strParts)
                If dicWords.exists(UCase(strParts(intPart))) Then
                    strDicWords = Split(dicWords.Item(UCase(strParts(intPart))), "|")
                    bFound = False
                    For intD = 0 To UBound(strDicWords)
                        If strParts(intPart) = strDicWords(intD) Then
                            bFound = True
                            Exit For
                        End If
                    Next
                    If Not bFound Then
                        dicWords.Item(UCase(strParts(intPart))) = dicWords.Item(UCase(strParts(intPart))) & "|" & strParts(intPart)
                    End If
                Else
                    dicWords.Add UCase(strParts(intPart)), cel.Offset(0, -1) & "|" & strParts(intPart)
                End If
            Next
        Next
    End With
    
    lngNR = 2
    ' Print the dictionary entries
    With Sheets("Output")
        For Each k In dicWords.Keys
            strParts = Split(dicWords(k), "|")
            If UBound(strParts) > 1 Then
                .Cells(lngNR, "A") = strParts(0)
                For intPart = 1 To UBound(strParts)
                    .Cells(lngNR, intPart + 1) = strParts(intPart)
                Next
                lngNR = lngNR + 1
            End If
        Next
        ' Write headings
        intD = 2
        For intCol = 2 To .Cells.Find("*", SearchOrder:=xlByColumns, SearchDirection:=xlPrevious).Column
            .Cells(1, intD) = "Format " & intD - 1
            intD = intD + 1
        Next
    End With
End Sub

Open in new window

0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41750619
Hi Martin - Thanks for your kind response, but the code is giving compile error - Sub or function not defined in the line - "Remove punctuation". I think there should be another function which is referred in this.

Thank you
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 41750634
I included the RemovePunctuation function in the 28959127a.xlsm workbook I attached in post ID: 41733202, but here it is.

Function RemovePunctuation(strText As String) As String

With CreateObject("vbscript.regexp")
    .Pattern = "[^A-Za-z0-9\ -]"
    .Global = True
    RemovePunctuation = .Replace(strText, vbNullString)
End With
End Function

Open in new window

0
 

Author Comment

by:Vijayendra S. Murthy
ID: 41750642
Sorry Martin, actually I thought this is the updated code and that file is old one only. My mistake. Now everything is fine.

Thank you once again
0
 
LVL 45

Expert Comment

by:Martin Liss
ID: 41750644
You're welcome and I'm glad I was able to help.

In my profile you'll find links to some articles I've written that may interest you.

Marty - Microsoft MVP 2009 to 2016
              Experts Exchange MVE 2015
              Experts Exchange Top Expert Visual Basic Classic 2012 to 2015
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

A little background as to how I came to I design this code: Around 5 years ago I designed an add-in that formatted Excel files to a corporate standard, applying different cell colours and font type depending on whether the cells contained inputs,…
Approximate matching with VLOOKUP and MATCH seems to me to be a greatly under-used technique, and one which is vital for getting good performance out of large lookups. Until recently I would always have advised using an exact match for simplicity an…
This Micro Tutorial will demonstrate on a Mac how to change the sort order for chart legend values and decrpyt the intimidating chart menu.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

21 Experts available now in Live!

Get 1:1 Help Now