Solved

Excel Macro to grab multiple data elements from XML

Posted on 2011-03-15
24
771 Views
Last Modified: 2012-05-11
Hi,

A few Experts have tried to help me with a data manipulation challenge.

In http:Q_26804341.html  Sid created a macro that would extract the text element (text=) from each node of a directory of XML files. It would then split out the unique values into separate tabs of an Excel workbook with all unique values listed. This worked great.

Then I needed to perform the same operation (find and list unique values in a spreadsheet) for a different data element in a directory of XML files. This time it was CID=, found in many nodes but not all.

The text= values in each node is the first data element in each node. The CID= values, if present are the 7th data element within the nodes they appear.

I realized that Sid's solution didn't seem to be looking for anything called "text=", so when I went to modify the macro for "CID=", I was stumped.

So on http:Q_26820317.html I asked for assistance pulling out a different value. The accepted solution was supposed to just pull out all elements into different columns of a spreadsheet, but I had to leave this project for nearly two weeks and I had to accept the final solution without testing (I hate leaving experts hanging...)

Long story short, it doesn't work. Nothing seems to happen when I run the macro. But more importantly, it was really trying to do more than I needed. I need two values - "text=" and "CID=" pulled from the XML files, duplicates removed and displayed in columns, preferably just like the first macro from the first question, but just have another column with the CID unique values.

If someone can assist me with macro modifications, I would appreciate some assistance. I'm happy to break things up over multiple questions for more points. I'm just in over my head here and still need a solution.

I've attached two XML sample files, to mimic the directory of numerous files.

And I'm pasting the original code, which is closest to the out put I need, but just don't know of a way to change from "text=" to "CID=" as the data element from which to pull unique values.

XML-to-Excel-samples.zip

Thanks

Bill
Private Sub CommandButton1_Click()
    Dim strPath As String, strCurrentFile As String
    Dim MyData As String, strData() As String, strFile As String
    Dim temp() As String
    Dim I As Long, LastRow As Long
    Dim ws As Worksheet
    
    '~~> Change the path to the folder where the XML's are stored
    strPath = "C:\Temp\"
    
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False
    
    strCurrentFile = Dir(strPath & "c*.xml")
    
    '~~> Loop through the folder to get the xml files
    Do While strCurrentFile <> ""
        strFile = strPath & strCurrentFile
        
        Set ws = Sheets.Add
        ws.Name = Replace(strCurrentFile, ".xml", "", , , vbTextCompare)
        ws.Range("A1") = "File"
        ws.Range("B1") = "Text"
        
        '~~> Open XML as binary for faster reading in one go
        Open strFile For Binary As #1
        MyData = Space$(LOF(1))
        Get #1, , MyData
        Close #1
        
        '~~> Split the data and store it in an array
        strData() = Split(MyData, vbCrLf)
        
        '~~> Get the next available row
        LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
        
        '~~> Generate output WITH DUPLICATES
        For I = 0 To UBound(strData())
            If InStr(strData(I), "<node template=") Then
                temp = Split(strData(I), """")
                ws.Range("A" & LastRow) = strFile
                ws.Range("B" & LastRow) = temp(3)
                LastRow = LastRow + 1
            End If
        Next
        
        LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
        
        '~~> Remove DUPLICATES
        ws.Range("$A$1:$B$" & LastRow).RemoveDuplicates Columns:=Array(1, 2), Header _
        :=xlYes
        
        ws.Cells.EntireColumn.AutoFit
        
        Application.ScreenUpdating = True
        Application.DisplayAlerts = True
    
        strCurrentFile = Dir
    Loop
End Sub

Open in new window

0
Comment
Question by:billium99
  • 11
  • 6
  • 4
  • +1
24 Comments
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
Have you thought of using the Microsoft  XML library to do what you want?
0
 
LVL 15

Expert Comment

by:Simon Ball
Comment Utility
Dim J As Integer
        '~~> Generate output WITH DUPLICATES
        For I = 0 To UBound(strData())
            If InStr(strData(I), "<node template=") Then
                temp = Split(strData(I), """")
                ws.Range("A" & LastRow) = strFile
                ws.Range("B" & LastRow) = temp(3)
                'MsgBox UBound(temp())
                For J = 0 To UBound(temp())
                If InStr(temp(J), "CID") Then
                ws.Range("C" & LastRow) = temp(J + 1)
                J = UBound(temp())
                End If
                Next
                LastRow = LastRow + 1
            End If
        Next

Open in new window


replace you lines 37 to 45 with the above.

the loop now searches through the remaining text items split into the array temp() and if any of them = CID, it writes the following element of the array into column c.

here is my test output. what-Im-hoping-to-get.xlsm

i also had to change his loop to point to files named S in my download folder... so if you use the code embedded in the attached spreadsheet you will need to change that.

it could also do with an error trap to check if the name already exists in line 21.
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
Here's how to do what you're after using the MSXML2 library.

It's just my humble opinon, but this approach is likely to be the easiest to extend in future - and for your other related question too.
Public Sub ParseXML()

    ' Declare constants
    Const PATH As String = "C:\CQEE\LAR Logic Flows\test\"  ' XML file path

    ' Declare variables
    Dim attAttribute As MSXML2.IXMLDOMAttribute  ' Attribute
    Dim docDocument As MSXML2.DOMDocument60      ' Document
    Dim lngRow As Long                           ' Row
    Dim nodNode As MSXML2.IXMLDOMNode            ' Node node
    Dim nodTree As MSXML2.IXMLDOMNode            ' Tree node
    Dim strFile As String                        ' File
    Dim wksWorksheet As Excel.Worksheet          ' Worksheet

    ' Disable screen updates
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False

    ' Process XML files
    strFile = Dir(PATH & "sample*.xml")
    Do Until strFile = ""

        ' Create worksheet
        Set wksWorksheet = Sheets.Add
        wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
        wksWorksheet.Range("A1") = "File"
        wksWorksheet.Range("B1") = "Text"
        wksWorksheet.Range("C1") = "CID"
        lngRow = 2

        ' Parse XML
        Set docDocument = New MSXML2.DOMDocument60
        docDocument.async = False
        docDocument.setProperty "ProhibitDTD", False
        docDocument.validateOnParse = False
        If docDocument.Load(PATH & strFile) Then

            ' Find 'tree' node
            For Each nodTree In docDocument.childNodes
                If nodTree.baseName = "tree" Then

                    ' Process 'node' nodes
                    For Each nodNode In nodTree.childNodes
                        wksWorksheet.Range("A" & lngRow) = PATH & strFile

                        ' Process attributes
                        For Each attAttribute In nodNode.attributes
                            Select Case attAttribute.baseName
                                Case "text"
                                    wksWorksheet.Range("B" & lngRow) = attAttribute.Value
                                Case "CID"
                                    wksWorksheet.Range("C" & lngRow) = attAttribute.Value
                            End Select
                        Next
                        lngRow = lngRow + 1
                    Next
                End If
            Next
        End If
        Set docDocument = Nothing

        ' Remove duplicates
        wksWorksheet.Range("$A$1:$C$" & lngRow).RemoveDuplicates , xlYes
        wksWorksheet.Cells.EntireColumn.AutoFit

        ' Get next XML file
        strFile = Dir
    Loop

    ' Enable screen updates
    Application.ScreenUpdating = True
    Application.DisplayAlerts = True

End Sub

Open in new window

0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
And here's one I made earlier!  :-)
what-Im-hoping-to-get.xlsm
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
If this is a large XML file, you should first populate an array variable and then populate your worksheet.  It will be much faster.

http://www.experts-exchange.com/A_2253-Fast-Data-Push-to-Excel.html
0
 
LVL 1

Author Comment

by:billium99
Comment Utility
Thanks guys - Sudonim - I think you only pasted a small section of your code.

Jez, your solution does not seem to strip out duplicates. If it makes any difference, it is not necessary to line up CID values with their corresponding node text - they will be two separate columns dealt with independently, but manually removing duplicates will be prohibitive once I run this on the full directory.

Is that an easy tweak?

Thanks again!

Bill
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
Are you saying that you want to handle duplicate Text and duplicate CID values independently?
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
Can you post the results you're expecting for the sample XML files you've already posted?
0
 
LVL 1

Author Comment

by:billium99
Comment Utility
Yes correct - duplicates should be avoided in both columns for each tab but they will be treated as separate columns of data - there is no need to keep a CID value in row 5 because it corresponds to the text value.

And the desired output was in the original zip file attached to my question. It removed duplicates, but otherwise look sexactly like the output your code creates.

Thanks

Bill
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
RemoveDuplicates removes entire rows, so you can't use it if Text and CID values are in separate columns of the same row - otherwise a duplicate CID causes non-duplicate text to be deleted!  :-(
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
What output would you expect after parsing the following node text/CID values:
Text   CID
----   ---
Text1  CID1
Text2  CID1
Text2  CID2

Open in new window

0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xslm" spreadsheet only lists c01456783 in the sample2 worksheet.

Is this just an oversight?
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 15

Assisted Solution

by:Simon Ball
Simon Ball earned 150 total points
Comment Utility
I only posted a part of the code and told you which lines you needed to change.

drop my code sample into your current system, replacing your lines of that for loop with mine.


Private Sub CommandButton1_Click()
    Dim strPath As String, strCurrentFile As String
    Dim MyData As String, strData() As String, strFile As String
    Dim temp() As String
    Dim I As Long, LastRow As Long
    Dim ws As Worksheet
    
    '~~> Change the path to the folder where the XML's are stored
    strPath = "C:\Temp\"
    
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False
    
    strCurrentFile = Dir(strPath & "c*.xml")
    
    '~~> Loop through the folder to get the xml files
    Do While strCurrentFile <> ""
        strFile = strPath & strCurrentFile
        
        Set ws = Sheets.Add
        ws.Name = Replace(strCurrentFile, ".xml", "", , , vbTextCompare)
        ws.Range("A1") = "File"
        ws.Range("B1") = "Text"
        
        '~~> Open XML as binary for faster reading in one go
        Open strFile For Binary As #1
        MyData = Space$(LOF(1))
        Get #1, , MyData
        Close #1
        
        '~~> Split the data and store it in an array
        strData() = Split(MyData, vbCrLf)
        
        '~~> Get the next available row
        LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
        
Dim J As Integer
        '~~> Generate output WITH DUPLICATES
        For I = 0 To UBound(strData())
            If InStr(strData(I), "<node template=") Then
                temp = Split(strData(I), """")
                ws.Range("A" & LastRow) = strFile
                ws.Range("B" & LastRow) = temp(3)
                'MsgBox UBound(temp())
                For J = 0 To UBound(temp())
                If InStr(temp(J), "CID") Then
                ws.Range("C" & LastRow) = temp(J + 1)
                J = UBound(temp())
                End If
                Next
                LastRow = LastRow + 1
            End If
        Next
        
        LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
        
        '~~> Remove DUPLICATES
        ws.Range("$A$1:$B$" & LastRow).RemoveDuplicates Columns:=Array(1, 2), Header _
        :=xlYes
        
        ws.Cells.EntireColumn.AutoFit
        
        Application.ScreenUpdating = True
        Application.DisplayAlerts = True
    
        strCurrentFile = Dir
    Loop
End Sub

Open in new window

0
 
LVL 15

Expert Comment

by:Simon Ball
Comment Utility
JW's solution is more elegant, but mine answers the question you asked.

I also provided a version of your spreadsheet with my code in it.
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
No offense to Sudonim, but I don't believe his solution addresses the duplicate issue ...
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Bill

Since the worksheet name is the name of the XML file (minus extension), the first column (full path) seems redundant.

How do you need to handle duplicate text and duplicate CID values?
  - are they only duplicates if both text and CID values are the same?
  - are null/empty CID values ignored?
  - which duplicate entry takes precedence?
0
 
LVL 17

Expert Comment

by:JezWalters
Comment Utility
Does this give you the behaviour you're expecting?
Option Explicit

Public Sub ParseXML()

    ' Declare constants
    Const PATH As String = "C:\CQEE\LAR Logic Flows\test\"  ' XML file path

    ' Declare variables
    Dim attAttribute As MSXML2.IXMLDOMAttribute  ' Attribute
    Dim colCIDs As Collection                    ' CID values
    Dim colTexts As Collection                   ' Text values
    Dim docDocument As MSXML2.DOMDocument60      ' Document
    Dim lngCIDRow As Long                        ' CID row
    Dim lngTextRow As Long                       ' Text row
    Dim nodNode As MSXML2.IXMLDOMNode            ' Node node
    Dim nodTree As MSXML2.IXMLDOMNode            ' Tree node
    Dim strFile As String                        ' File
    Dim varValue As Variant                      ' Value
    Dim wksWorksheet As Excel.Worksheet          ' Worksheet

    ' Disable screen updates
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False

    ' Process XML files
    strFile = Dir(PATH & "sample*.xml")
    Do Until strFile = ""

        ' Create worksheet
        Set wksWorksheet = Sheets.Add
        wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
        wksWorksheet.Range("A1") = "File"
        wksWorksheet.Range("B1") = "Text"
        wksWorksheet.Range("C1") = "CID"

        lngTextRow = 2  ' Skip header row
        lngCIDRow = 2   ' Skip header row
        Set colTexts = New Collection
        Set colCIDs = New Collection

        ' Parse XML
        Set docDocument = New MSXML2.DOMDocument60
        docDocument.async = False
        docDocument.setProperty "ProhibitDTD", False
        docDocument.validateOnParse = False
        If docDocument.Load(PATH & strFile) Then

            ' Find 'tree' node
            For Each nodTree In docDocument.childNodes
                If nodTree.baseName = "tree" Then

                    ' Process 'node' nodes
                    For Each nodNode In nodTree.childNodes
                        If nodNode.baseName = "node" Then

                            ' Process attributes
                            For Each attAttribute In nodNode.attributes
                                Select Case attAttribute.baseName

                                    ' Add text value, if unique
                                    Case "text"
                                        varValue = Null
                                        On Error Resume Next  ' Ignore missing value
                                        varValue = colTexts(attAttribute.Value)
                                        On Error GoTo 0
                                        If Nz(attAttribute.Value) <> "" And _
                                           IsNull(varValue) Then  ' Skip empty or existing values
                                            wksWorksheet.Range("A" & lngTextRow) = PATH & strFile
                                            wksWorksheet.Range("B" & lngTextRow) = attAttribute.Value
                                            colTexts.Add attAttribute.Value, attAttribute.Value
                                            lngTextRow = lngTextRow + 1
                                        End If

                                    ' Add CID value, if unique
                                    Case "CID"
                                        varValue = Null
                                        On Error Resume Next  ' Ignore missing value
                                        varValue = colCIDs(attAttribute.Value)
                                        On Error GoTo 0
                                        If Nz(attAttribute.Value) <> "" And _
                                           IsNull(varValue) Then  ' Skip empty or existing values
                                            wksWorksheet.Range("A" & lngCIDRow) = PATH & strFile
                                            wksWorksheet.Range("C" & lngCIDRow) = attAttribute.Value
                                            colCIDs.Add attAttribute.Value, attAttribute.Value
                                            lngCIDRow = lngCIDRow + 1
                                        End If
                                End Select
                                Set attAttribute = Nothing
                            Next
                        End If
                        Set nodNode = Nothing
                    Next
                End If
                Set nodTree = Nothing
            Next
        End If
        Set docDocument = Nothing
        Set colTexts = Nothing
        Set colCIDs = Nothing

        ' Resize columns
        wksWorksheet.Cells.EntireColumn.AutoFit
        Set wksWorksheet = Nothing

        ' Get next XML file
        strFile = Dir
    Loop

    ' Enable screen updates
    Application.ScreenUpdating = True
    Application.DisplayAlerts = True

End Sub

Open in new window

0
 
LVL 1

Author Comment

by:billium99
Comment Utility
WOw - lots of responses - thanks for all the consideration gang!

JW - your last macro is generating an error - sub not defined on this line:

 If Nz(attAttribute.Value) <> "" And _

Also you asked:

Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xslm" spreadsheet only lists c01456783 in the sample2 worksheet.

Is this just an oversight?

Yes it is - I was trying to show that we don't include the same value twice.

Aikimark - yes the first column is most likely redundant, but could be used in a pinch to identify precisely where the XML file is when I reverse this process and try to repopulate the XML with translated spreadsheet values.

Bill

0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Bill

>>..reverse this process and try to repopulate the XML with translated spreadsheet values.

I think you'd better describe this ASAP.  Since you aren't bringing in all the data, some data will be lost.  Also, there might be a MUCH simpler way to accomplish your end goal.
0
 
LVL 1

Author Comment

by:billium99
Comment Utility
Aikimark - I have a separate question started wuth regard to my last point - and as long as that function replaces all matching instances, then it doesn't matter if I have removed duplicates - as long as I have one instance. At least that's my thought.

Feel free to check out http:Q_26888950.html although I think there are already two solutions. I just need to get this first part right to have a legitimate test file to verify both stages are working.

Awesome help on these topics though. Thank for your time!

Bill
0
 
LVL 45

Expert Comment

by:aikimark
Comment Utility
@Bill

It is customary to get a solution to each part of your question/problem before continuing with the next part.  Things change.

Also, you will see an Ask a Related Question link once you arrive at a solution in a thread.
0
 
LVL 1

Author Comment

by:billium99
Comment Utility
JW - any insight on the error with Nz?
0
 
LVL 17

Accepted Solution

by:
JezWalters earned 350 total points
Comment Utility
Humble apologies, I'm too used to programming for Access!  :-)  This should do the trick:
Option Explicit

Public Sub ParseXML()

    ' Declare constants
    Const PATH As String = "C:\CQEE\LAR Logic Flows\test\"  ' XML file path

    ' Declare variables
    Dim attAttribute As MSXML2.IXMLDOMAttribute  ' Attribute
    Dim colCIDs As Collection                    ' CID values
    Dim colTexts As Collection                   ' Text values
    Dim docDocument As MSXML2.DOMDocument60      ' Document
    Dim lngCIDRow As Long                        ' CID row
    Dim lngTextRow As Long                       ' Text row
    Dim nodNode As MSXML2.IXMLDOMNode            ' Node node
    Dim nodTree As MSXML2.IXMLDOMNode            ' Tree node
    Dim strFile As String                        ' File
    Dim varValue As Variant                      ' Value
    Dim wksWorksheet As Excel.Worksheet          ' Worksheet

    ' Disable screen updates
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False

    ' Process XML files
    strFile = Dir(PATH & "sample*.xml")
    Do Until strFile = ""

        ' Create worksheet
        Set wksWorksheet = Sheets.Add
        wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
        wksWorksheet.Range("A1") = "File"
        wksWorksheet.Range("B1") = "Text"
        wksWorksheet.Range("C1") = "CID"

        lngTextRow = 2  ' Skip header row
        lngCIDRow = 2   ' Skip header row
        Set colTexts = New Collection
        Set colCIDs = New Collection

        ' Parse XML
        Set docDocument = New MSXML2.DOMDocument60
        docDocument.async = False
        docDocument.setProperty "ProhibitDTD", False
        docDocument.validateOnParse = False
        If docDocument.Load(PATH & strFile) Then

            ' Find 'tree' node
            For Each nodTree In docDocument.childNodes
                If nodTree.baseName = "tree" Then

                    ' Process 'node' nodes
                    For Each nodNode In nodTree.childNodes
                        If nodNode.baseName = "node" Then

                            ' Process attributes
                            For Each attAttribute In nodNode.attributes
                                Select Case attAttribute.baseName

                                    ' Add text value, if unique
                                    Case "text"
                                        varValue = Null
                                        On Error Resume Next  ' Ignore missing value
                                        varValue = colTexts(attAttribute.Value)
                                        On Error GoTo 0
                                        If attAttribute.Value <> "" And _
                                           IsNull(varValue) Then  ' Skip empty or existing values
                                            wksWorksheet.Range("A" & lngTextRow) = PATH & strFile
                                            wksWorksheet.Range("B" & lngTextRow) = attAttribute.Value
                                            colTexts.Add attAttribute.Value, attAttribute.Value
                                            lngTextRow = lngTextRow + 1
                                        End If

                                    ' Add CID value, if unique
                                    Case "CID"
                                        varValue = Null
                                        On Error Resume Next  ' Ignore missing value
                                        varValue = colCIDs(attAttribute.Value)
                                        On Error GoTo 0
                                        If attAttribute.Value <> "" And _
                                           IsNull(varValue) Then  ' Skip empty or existing values
                                            wksWorksheet.Range("A" & lngCIDRow) = PATH & strFile
                                            wksWorksheet.Range("C" & lngCIDRow) = attAttribute.Value
                                            colCIDs.Add attAttribute.Value, attAttribute.Value
                                            lngCIDRow = lngCIDRow + 1
                                        End If
                                End Select
                                Set attAttribute = Nothing
                            Next
                        End If
                        Set nodNode = Nothing
                    Next
                End If
                Set nodTree = Nothing
            Next
        End If
        Set docDocument = Nothing
        Set colTexts = Nothing
        Set colCIDs = Nothing

        ' Resize columns
        wksWorksheet.Cells.EntireColumn.AutoFit
        Set wksWorksheet = Nothing

        ' Get next XML file
        strFile = Dir
    Loop

    ' Enable screen updates
    Application.ScreenUpdating = True
    Application.DisplayAlerts = True

End Sub

Open in new window

0
 
LVL 1

Author Closing Comment

by:billium99
Comment Utility
Thanks for the help!
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Approximate matching with VLOOKUP and MATCH seems to me to be a greatly under-used technique, and one which is vital for getting good performance out of large lookups. Until recently I would always have advised using an exact match for simplicity an…
This article descibes how to create a connection between Excel and SAP and how to move data from Excel to SAP or the other way around.
This Micro Tutorial will demonstrate the scrolling table in Microsoft Excel using the INDEX function.
This Micro Tutorial will demonstrate how to use a scrolling table in Microsoft Excel using the INDEX function.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now