asked on

Excel Macro to grab multiple data elements from XML

Hi,

A few Experts have tried to help me with a data manipulation challenge.

In http:Q_26804341.html Sid created a macro that would extract the text element (text=) from each node of a directory of XML files. It would then split out the unique values into separate tabs of an Excel workbook with all unique values listed. This worked great.

Then I needed to perform the same operation (find and list unique values in a spreadsheet) for a different data element in a directory of XML files. This time it was CID=, found in many nodes but not all.

The text= values in each node is the first data element in each node. The CID= values, if present are the 7th data element within the nodes they appear.

I realized that Sid's solution didn't seem to be looking for anything called "text=", so when I went to modify the macro for "CID=", I was stumped.

So on http:Q_26820317.html I asked for assistance pulling out a different value. The accepted solution was supposed to just pull out all elements into different columns of a spreadsheet, but I had to leave this project for nearly two weeks and I had to accept the final solution without testing (I hate leaving experts hanging...)

Long story short, it doesn't work. Nothing seems to happen when I run the macro. But more importantly, it was really trying to do more than I needed. I need two values - "text=" and "CID=" pulled from the XML files, duplicates removed and displayed in columns, preferably just like the first macro from the first question, but just have another column with the CID unique values.

If someone can assist me with macro modifications, I would appreciate some assistance. I'm happy to break things up over multiple questions for more points. I'm just in over my head here and still need a solution.

I've attached two XML sample files, to mimic the directory of numerous files.

And I'm pasting the original code, which is closest to the out put I need, but just don't know of a way to change from "text=" to "CID=" as the data element from which to pull unique values.

XML-to-Excel-samples.zip

Thanks

Bill

Private Sub CommandButton1_Click()
    Dim strPath As String, strCurrentFile As String
    Dim MyData As String, strData() As String, strFile As String
    Dim temp() As String
    Dim I As Long, LastRow As Long
    Dim ws As Worksheet
    
    '~~> Change the path to the folder where the XML's are stored
    strPath = "C:\Temp\"
    
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False
    
    strCurrentFile = Dir(strPath & "c*.xml")
    
    '~~> Loop through the folder to get the xml files
    Do While strCurrentFile <> ""
        strFile = strPath & strCurrentFile
        
        Set ws = Sheets.Add
        ws.Name = Replace(strCurrentFile, ".xml", "", , , vbTextCompare)
        ws.Range("A1") = "File"
        ws.Range("B1") = "Text"
        
        '~~> Open XML as binary for faster reading in one go
        Open strFile For Binary As #1
        MyData = Space$(LOF(1))
        Get #1, , MyData
        Close #1
        
        '~~> Split the data and store it in an array
        strData() = Split(MyData, vbCrLf)
        
        '~~> Get the next available row
        LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
        
        '~~> Generate output WITH DUPLICATES
        For I = 0 To UBound(strData())
            If InStr(strData(I), "<node template=") Then
                temp = Split(strData(I), """")
                ws.Range("A" & LastRow) = strFile
                ws.Range("B" & LastRow) = temp(3)
                LastRow = LastRow + 1
            End If
        Next
        
        LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
        
        '~~> Remove DUPLICATES
        ws.Range("$A$1:$B$" & LastRow).RemoveDuplicates Columns:=Array(1, 2), Header _
        :=xlYes
        
        ws.Cells.EntireColumn.AutoFit
        
        Application.ScreenUpdating = True
        Application.DisplayAlerts = True
    
        strCurrentFile = Dir
    Loop
End Sub

Open in new window

Jez Walters

Have you thought of using the Microsoft XML library to do what you want?

Si Ball

Dim J As Integer
        '~~> Generate output WITH DUPLICATES
        For I = 0 To UBound(strData())
            If InStr(strData(I), "<node template=") Then
                temp = Split(strData(I), """")
                ws.Range("A" & LastRow) = strFile
                ws.Range("B" & LastRow) = temp(3)
                'MsgBox UBound(temp())
                For J = 0 To UBound(temp())
                If InStr(temp(J), "CID") Then
                ws.Range("C" & LastRow) = temp(J + 1)
                J = UBound(temp())
                End If
                Next
                LastRow = LastRow + 1
            End If
        Next

Open in new window

replace you lines 37 to 45 with the above.

the loop now searches through the remaining text items split into the array temp() and if any of them = CID, it writes the following element of the array into column c.

here is my test output. what-Im-hoping-to-get.xlsm

i also had to change his loop to point to files named S in my download folder... so if you use the code embedded in the attached spreadsheet you will need to change that.

it could also do with an error trap to check if the name already exists in line 21.

Jez Walters

Here's how to do what you're after using the MSXML2 library.

It's just my humble opinon, but this approach is likely to be the easiest to extend in future - and for your other related question too.

Public Sub ParseXML()

    ' Declare constants
    Const PATH As String = "C:\CQEE\LAR Logic Flows\test\"  ' XML file path

    ' Declare variables
    Dim attAttribute As MSXML2.IXMLDOMAttribute  ' Attribute
    Dim docDocument As MSXML2.DOMDocument60      ' Document
    Dim lngRow As Long                           ' Row
    Dim nodNode As MSXML2.IXMLDOMNode            ' Node node
    Dim nodTree As MSXML2.IXMLDOMNode            ' Tree node
    Dim strFile As String                        ' File
    Dim wksWorksheet As Excel.Worksheet          ' Worksheet

    ' Disable screen updates
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False

    ' Process XML files
    strFile = Dir(PATH & "sample*.xml")
    Do Until strFile = ""

        ' Create worksheet
        Set wksWorksheet = Sheets.Add
        wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
        wksWorksheet.Range("A1") = "File"
        wksWorksheet.Range("B1") = "Text"
        wksWorksheet.Range("C1") = "CID"
        lngRow = 2

        ' Parse XML
        Set docDocument = New MSXML2.DOMDocument60
        docDocument.async = False
        docDocument.setProperty "ProhibitDTD", False
        docDocument.validateOnParse = False
        If docDocument.Load(PATH & strFile) Then

            ' Find 'tree' node
            For Each nodTree In docDocument.childNodes
                If nodTree.baseName = "tree" Then

                    ' Process 'node' nodes
                    For Each nodNode In nodTree.childNodes
                        wksWorksheet.Range("A" & lngRow) = PATH & strFile

                        ' Process attributes
                        For Each attAttribute In nodNode.attributes
                            Select Case attAttribute.baseName
                                Case "text"
                                    wksWorksheet.Range("B" & lngRow) = attAttribute.Value
                                Case "CID"
                                    wksWorksheet.Range("C" & lngRow) = attAttribute.Value
                            End Select
                        Next
                        lngRow = lngRow + 1
                    Next
                End If
            Next
        End If
        Set docDocument = Nothing

        ' Remove duplicates
        wksWorksheet.Range("$A$1:$C$" & lngRow).RemoveDuplicates , xlYes
        wksWorksheet.Cells.EntireColumn.AutoFit

        ' Get next XML file
        strFile = Dir
    Loop

    ' Enable screen updates
    Application.ScreenUpdating = True
    Application.DisplayAlerts = True

End Sub

Open in new window

Jez Walters

And here's one I made earlier! :-)
what-Im-hoping-to-get.xlsm

aikimark

If this is a large XML file, you should first populate an array variable and then populate your worksheet. It will be much faster.

https://www.experts-exchange.com/A_2253-Fast-Data-Push-to-Excel.html

Bill Henderson

ASKER

Thanks guys - Sudonim - I think you only pasted a small section of your code.

Jez, your solution does not seem to strip out duplicates. If it makes any difference, it is not necessary to line up CID values with their corresponding node text - they will be two separate columns dealt with independently, but manually removing duplicates will be prohibitive once I run this on the full directory.

Is that an easy tweak?

Thanks again!

Bill

Jez Walters

Are you saying that you want to handle duplicate Text and duplicate CID values independently?

Jez Walters

Can you post the results you're expecting for the sample XML files you've already posted?

Bill Henderson

ASKER

Yes correct - duplicates should be avoided in both columns for each tab but they will be treated as separate columns of data - there is no need to keep a CID value in row 5 because it corresponds to the text value.

And the desired output was in the original zip file attached to my question. It removed duplicates, but otherwise look sexactly like the output your code creates.

Thanks

Bill

Jez Walters

RemoveDuplicates removes entire rows, so you can't use it if Text and CID values are in separate columns of the same row - otherwise a duplicate CID causes non-duplicate text to be deleted! :-(

Jez Walters

What output would you expect after parsing the following node text/CID values:

Text   CID
----   ---
Text1  CID1
Text2  CID1
Text2  CID2

Open in new window

Jez Walters

Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xslm" spreadsheet only lists c01456783 in the sample2 worksheet.

Is this just an oversight?

SOLUTION

Si Ball

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Si Ball

JW's solution is more elegant, but mine answers the question you asked.

I also provided a version of your spreadsheet with my code in it.

Jez Walters

No offense to Sudonim, but I don't believe his solution addresses the duplicate issue ...

aikimark

@Bill

Since the worksheet name is the name of the XML file (minus extension), the first column (full path) seems redundant.

How do you need to handle duplicate text and duplicate CID values?
- are they only duplicates if both text and CID values are the same?
- are null/empty CID values ignored?
- which duplicate entry takes precedence?

Jez Walters

Does this give you the behaviour you're expecting?

Option Explicit

Public Sub ParseXML()

    ' Declare constants
    Const PATH As String = "C:\CQEE\LAR Logic Flows\test\"  ' XML file path

    ' Declare variables
    Dim attAttribute As MSXML2.IXMLDOMAttribute  ' Attribute
    Dim colCIDs As Collection                    ' CID values
    Dim colTexts As Collection                   ' Text values
    Dim docDocument As MSXML2.DOMDocument60      ' Document
    Dim lngCIDRow As Long                        ' CID row
    Dim lngTextRow As Long                       ' Text row
    Dim nodNode As MSXML2.IXMLDOMNode            ' Node node
    Dim nodTree As MSXML2.IXMLDOMNode            ' Tree node
    Dim strFile As String                        ' File
    Dim varValue As Variant                      ' Value
    Dim wksWorksheet As Excel.Worksheet          ' Worksheet

    ' Disable screen updates
    Application.ScreenUpdating = False
    Application.DisplayAlerts = False

    ' Process XML files
    strFile = Dir(PATH & "sample*.xml")
    Do Until strFile = ""

        ' Create worksheet
        Set wksWorksheet = Sheets.Add
        wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
        wksWorksheet.Range("A1") = "File"
        wksWorksheet.Range("B1") = "Text"
        wksWorksheet.Range("C1") = "CID"

        lngTextRow = 2  ' Skip header row
        lngCIDRow = 2   ' Skip header row
        Set colTexts = New Collection
        Set colCIDs = New Collection

        ' Parse XML
        Set docDocument = New MSXML2.DOMDocument60
        docDocument.async = False
        docDocument.setProperty "ProhibitDTD", False
        docDocument.validateOnParse = False
        If docDocument.Load(PATH & strFile) Then

            ' Find 'tree' node
            For Each nodTree In docDocument.childNodes
                If nodTree.baseName = "tree" Then

                    ' Process 'node' nodes
                    For Each nodNode In nodTree.childNodes
                        If nodNode.baseName = "node" Then

                            ' Process attributes
                            For Each attAttribute In nodNode.attributes
                                Select Case attAttribute.baseName

                                    ' Add text value, if unique
                                    Case "text"
                                        varValue = Null
                                        On Error Resume Next  ' Ignore missing value
                                        varValue = colTexts(attAttribute.Value)
                                        On Error GoTo 0
                                        If Nz(attAttribute.Value) <> "" And _
                                           IsNull(varValue) Then  ' Skip empty or existing values
                                            wksWorksheet.Range("A" & lngTextRow) = PATH & strFile
                                            wksWorksheet.Range("B" & lngTextRow) = attAttribute.Value
                                            colTexts.Add attAttribute.Value, attAttribute.Value
                                            lngTextRow = lngTextRow + 1
                                        End If

                                    ' Add CID value, if unique
                                    Case "CID"
                                        varValue = Null
                                        On Error Resume Next  ' Ignore missing value
                                        varValue = colCIDs(attAttribute.Value)
                                        On Error GoTo 0
                                        If Nz(attAttribute.Value) <> "" And _
                                           IsNull(varValue) Then  ' Skip empty or existing values
                                            wksWorksheet.Range("A" & lngCIDRow) = PATH & strFile
                                            wksWorksheet.Range("C" & lngCIDRow) = attAttribute.Value
                                            colCIDs.Add attAttribute.Value, attAttribute.Value
                                            lngCIDRow = lngCIDRow + 1
                                        End If
                                End Select
                                Set attAttribute = Nothing
                            Next
                        End If
                        Set nodNode = Nothing
                    Next
                End If
                Set nodTree = Nothing
            Next
        End If
        Set docDocument = Nothing
        Set colTexts = Nothing
        Set colCIDs = Nothing

        ' Resize columns
        wksWorksheet.Cells.EntireColumn.AutoFit
        Set wksWorksheet = Nothing

        ' Get next XML file
        strFile = Dir
    Loop

    ' Enable screen updates
    Application.ScreenUpdating = True
    Application.DisplayAlerts = True

End Sub

Open in new window

Bill Henderson

ASKER

WOw - lots of responses - thanks for all the consideration gang!

JW - your last macro is generating an error - sub not defined on this line:

If Nz(attAttribute.Value) <> "" And _

Also you asked:

Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xslm" spreadsheet only lists c01456783 in the sample2 worksheet.

Is this just an oversight?

Yes it is - I was trying to show that we don't include the same value twice.

Aikimark - yes the first column is most likely redundant, but could be used in a pinch to identify precisely where the XML file is when I reverse this process and try to repopulate the XML with translated spreadsheet values.

Bill

aikimark

@Bill

>>..reverse this process and try to repopulate the XML with translated spreadsheet values.

I think you'd better describe this ASAP. Since you aren't bringing in all the data, some data will be lost. Also, there might be a MUCH simpler way to accomplish your end goal.

Bill Henderson

ASKER

Aikimark - I have a separate question started wuth regard to my last point - and as long as that function replaces all matching instances, then it doesn't matter if I have removed duplicates - as long as I have one instance. At least that's my thought.

Feel free to check out http:Q_26888950.html although I think there are already two solutions. I just need to get this first part right to have a legitimate test file to verify both stages are working.

Awesome help on these topics though. Thank for your time!

Bill

aikimark

@Bill

It is customary to get a solution to each part of your question/problem before continuing with the next part. Things change.

Also, you will see an Ask a Related Question link once you arrive at a solution in a thread.

Bill Henderson

ASKER

JW - any insight on the error with Nz?

ASKER CERTIFIED SOLUTION

Jez Walters

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Bill Henderson

ASKER

Thanks for the help!