Bill Henderson
asked on
Excel Macro to grab multiple data elements from XML
Hi,
A few Experts have tried to help me with a data manipulation challenge.
In http:Q_26804341.html Sid created a macro that would extract the text element (text=) from each node of a directory of XML files. It would then split out the unique values into separate tabs of an Excel workbook with all unique values listed. This worked great.
Then I needed to perform the same operation (find and list unique values in a spreadsheet) for a different data element in a directory of XML files. This time it was CID=, found in many nodes but not all.
The text= values in each node is the first data element in each node. The CID= values, if present are the 7th data element within the nodes they appear.
I realized that Sid's solution didn't seem to be looking for anything called "text=", so when I went to modify the macro for "CID=", I was stumped.
So on http:Q_26820317.html I asked for assistance pulling out a different value. The accepted solution was supposed to just pull out all elements into different columns of a spreadsheet, but I had to leave this project for nearly two weeks and I had to accept the final solution without testing (I hate leaving experts hanging...)
Long story short, it doesn't work. Nothing seems to happen when I run the macro. But more importantly, it was really trying to do more than I needed. I need two values - "text=" and "CID=" pulled from the XML files, duplicates removed and displayed in columns, preferably just like the first macro from the first question, but just have another column with the CID unique values.
If someone can assist me with macro modifications, I would appreciate some assistance. I'm happy to break things up over multiple questions for more points. I'm just in over my head here and still need a solution.
I've attached two XML sample files, to mimic the directory of numerous files.
And I'm pasting the original code, which is closest to the out put I need, but just don't know of a way to change from "text=" to "CID=" as the data element from which to pull unique values.
XML-to-Excel-samples.zip
Thanks
Bill
A few Experts have tried to help me with a data manipulation challenge.
In http:Q_26804341.html Sid created a macro that would extract the text element (text=) from each node of a directory of XML files. It would then split out the unique values into separate tabs of an Excel workbook with all unique values listed. This worked great.
Then I needed to perform the same operation (find and list unique values in a spreadsheet) for a different data element in a directory of XML files. This time it was CID=, found in many nodes but not all.
The text= values in each node is the first data element in each node. The CID= values, if present are the 7th data element within the nodes they appear.
I realized that Sid's solution didn't seem to be looking for anything called "text=", so when I went to modify the macro for "CID=", I was stumped.
So on http:Q_26820317.html I asked for assistance pulling out a different value. The accepted solution was supposed to just pull out all elements into different columns of a spreadsheet, but I had to leave this project for nearly two weeks and I had to accept the final solution without testing (I hate leaving experts hanging...)
Long story short, it doesn't work. Nothing seems to happen when I run the macro. But more importantly, it was really trying to do more than I needed. I need two values - "text=" and "CID=" pulled from the XML files, duplicates removed and displayed in columns, preferably just like the first macro from the first question, but just have another column with the CID unique values.
If someone can assist me with macro modifications, I would appreciate some assistance. I'm happy to break things up over multiple questions for more points. I'm just in over my head here and still need a solution.
I've attached two XML sample files, to mimic the directory of numerous files.
And I'm pasting the original code, which is closest to the out put I need, but just don't know of a way to change from "text=" to "CID=" as the data element from which to pull unique values.
XML-to-Excel-samples.zip
Thanks
Bill
Private Sub CommandButton1_Click()
Dim strPath As String, strCurrentFile As String
Dim MyData As String, strData() As String, strFile As String
Dim temp() As String
Dim I As Long, LastRow As Long
Dim ws As Worksheet
'~~> Change the path to the folder where the XML's are stored
strPath = "C:\Temp\"
Application.ScreenUpdating = False
Application.DisplayAlerts = False
strCurrentFile = Dir(strPath & "c*.xml")
'~~> Loop through the folder to get the xml files
Do While strCurrentFile <> ""
strFile = strPath & strCurrentFile
Set ws = Sheets.Add
ws.Name = Replace(strCurrentFile, ".xml", "", , , vbTextCompare)
ws.Range("A1") = "File"
ws.Range("B1") = "Text"
'~~> Open XML as binary for faster reading in one go
Open strFile For Binary As #1
MyData = Space$(LOF(1))
Get #1, , MyData
Close #1
'~~> Split the data and store it in an array
strData() = Split(MyData, vbCrLf)
'~~> Get the next available row
LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
'~~> Generate output WITH DUPLICATES
For I = 0 To UBound(strData())
If InStr(strData(I), "<node template=") Then
temp = Split(strData(I), """")
ws.Range("A" & LastRow) = strFile
ws.Range("B" & LastRow) = temp(3)
LastRow = LastRow + 1
End If
Next
LastRow = ws.Range("A" & Rows.Count).End(xlUp).Row + 1
'~~> Remove DUPLICATES
ws.Range("$A$1:$B$" & LastRow).RemoveDuplicates Columns:=Array(1, 2), Header _
:=xlYes
ws.Cells.EntireColumn.AutoFit
Application.ScreenUpdating = True
Application.DisplayAlerts = True
strCurrentFile = Dir
Loop
End Sub
Have you thought of using the Microsoft XML library to do what you want?
Dim J As Integer
'~~> Generate output WITH DUPLICATES
For I = 0 To UBound(strData())
If InStr(strData(I), "<node template=") Then
temp = Split(strData(I), """")
ws.Range("A" & LastRow) = strFile
ws.Range("B" & LastRow) = temp(3)
'MsgBox UBound(temp())
For J = 0 To UBound(temp())
If InStr(temp(J), "CID") Then
ws.Range("C" & LastRow) = temp(J + 1)
J = UBound(temp())
End If
Next
LastRow = LastRow + 1
End If
Next
replace you lines 37 to 45 with the above.
the loop now searches through the remaining text items split into the array temp() and if any of them = CID, it writes the following element of the array into column c.
here is my test output. what-Im-hoping-to-get.xlsm
i also had to change his loop to point to files named S in my download folder... so if you use the code embedded in the attached spreadsheet you will need to change that.
it could also do with an error trap to check if the name already exists in line 21.
Here's how to do what you're after using the MSXML2 library.
It's just my humble opinon, but this approach is likely to be the easiest to extend in future - and for your other related question too.
It's just my humble opinon, but this approach is likely to be the easiest to extend in future - and for your other related question too.
Public Sub ParseXML()
' Declare constants
Const PATH As String = "C:\CQEE\LAR Logic Flows\test\" ' XML file path
' Declare variables
Dim attAttribute As MSXML2.IXMLDOMAttribute ' Attribute
Dim docDocument As MSXML2.DOMDocument60 ' Document
Dim lngRow As Long ' Row
Dim nodNode As MSXML2.IXMLDOMNode ' Node node
Dim nodTree As MSXML2.IXMLDOMNode ' Tree node
Dim strFile As String ' File
Dim wksWorksheet As Excel.Worksheet ' Worksheet
' Disable screen updates
Application.ScreenUpdating = False
Application.DisplayAlerts = False
' Process XML files
strFile = Dir(PATH & "sample*.xml")
Do Until strFile = ""
' Create worksheet
Set wksWorksheet = Sheets.Add
wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
wksWorksheet.Range("A1") = "File"
wksWorksheet.Range("B1") = "Text"
wksWorksheet.Range("C1") = "CID"
lngRow = 2
' Parse XML
Set docDocument = New MSXML2.DOMDocument60
docDocument.async = False
docDocument.setProperty "ProhibitDTD", False
docDocument.validateOnParse = False
If docDocument.Load(PATH & strFile) Then
' Find 'tree' node
For Each nodTree In docDocument.childNodes
If nodTree.baseName = "tree" Then
' Process 'node' nodes
For Each nodNode In nodTree.childNodes
wksWorksheet.Range("A" & lngRow) = PATH & strFile
' Process attributes
For Each attAttribute In nodNode.attributes
Select Case attAttribute.baseName
Case "text"
wksWorksheet.Range("B" & lngRow) = attAttribute.Value
Case "CID"
wksWorksheet.Range("C" & lngRow) = attAttribute.Value
End Select
Next
lngRow = lngRow + 1
Next
End If
Next
End If
Set docDocument = Nothing
' Remove duplicates
wksWorksheet.Range("$A$1:$C$" & lngRow).RemoveDuplicates , xlYes
wksWorksheet.Cells.EntireColumn.AutoFit
' Get next XML file
strFile = Dir
Loop
' Enable screen updates
Application.ScreenUpdating = True
Application.DisplayAlerts = True
End Sub
And here's one I made earlier! :-)
what-Im-hoping-to-get.xlsm
what-Im-hoping-to-get.xlsm
If this is a large XML file, you should first populate an array variable and then populate your worksheet. It will be much faster.
https://www.experts-exchange.com/A_2253-Fast-Data-Push-to-Excel.html
https://www.experts-exchange.com/A_2253-Fast-Data-Push-to-Excel.html
ASKER
Thanks guys - Sudonim - I think you only pasted a small section of your code.
Jez, your solution does not seem to strip out duplicates. If it makes any difference, it is not necessary to line up CID values with their corresponding node text - they will be two separate columns dealt with independently, but manually removing duplicates will be prohibitive once I run this on the full directory.
Is that an easy tweak?
Thanks again!
Bill
Jez, your solution does not seem to strip out duplicates. If it makes any difference, it is not necessary to line up CID values with their corresponding node text - they will be two separate columns dealt with independently, but manually removing duplicates will be prohibitive once I run this on the full directory.
Is that an easy tweak?
Thanks again!
Bill
Are you saying that you want to handle duplicate Text and duplicate CID values independently?
Can you post the results you're expecting for the sample XML files you've already posted?
ASKER
Yes correct - duplicates should be avoided in both columns for each tab but they will be treated as separate columns of data - there is no need to keep a CID value in row 5 because it corresponds to the text value.
And the desired output was in the original zip file attached to my question. It removed duplicates, but otherwise look sexactly like the output your code creates.
Thanks
Bill
And the desired output was in the original zip file attached to my question. It removed duplicates, but otherwise look sexactly like the output your code creates.
Thanks
Bill
RemoveDuplicates removes entire rows, so you can't use it if Text and CID values are in separate columns of the same row - otherwise a duplicate CID causes non-duplicate text to be deleted! :-(
What output would you expect after parsing the following node text/CID values:
Text CID
---- ---
Text1 CID1
Text2 CID1
Text2 CID2
Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xsl m" spreadsheet only lists c01456783 in the sample2 worksheet.
Is this just an oversight?
Is this just an oversight?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
JW's solution is more elegant, but mine answers the question you asked.
I also provided a version of your spreadsheet with my code in it.
I also provided a version of your spreadsheet with my code in it.
No offense to Sudonim, but I don't believe his solution addresses the duplicate issue ...
@Bill
Since the worksheet name is the name of the XML file (minus extension), the first column (full path) seems redundant.
How do you need to handle duplicate text and duplicate CID values?
- are they only duplicates if both text and CID values are the same?
- are null/empty CID values ignored?
- which duplicate entry takes precedence?
Since the worksheet name is the name of the XML file (minus extension), the first column (full path) seems redundant.
How do you need to handle duplicate text and duplicate CID values?
- are they only duplicates if both text and CID values are the same?
- are null/empty CID values ignored?
- which duplicate entry takes precedence?
Does this give you the behaviour you're expecting?
Option Explicit
Public Sub ParseXML()
' Declare constants
Const PATH As String = "C:\CQEE\LAR Logic Flows\test\" ' XML file path
' Declare variables
Dim attAttribute As MSXML2.IXMLDOMAttribute ' Attribute
Dim colCIDs As Collection ' CID values
Dim colTexts As Collection ' Text values
Dim docDocument As MSXML2.DOMDocument60 ' Document
Dim lngCIDRow As Long ' CID row
Dim lngTextRow As Long ' Text row
Dim nodNode As MSXML2.IXMLDOMNode ' Node node
Dim nodTree As MSXML2.IXMLDOMNode ' Tree node
Dim strFile As String ' File
Dim varValue As Variant ' Value
Dim wksWorksheet As Excel.Worksheet ' Worksheet
' Disable screen updates
Application.ScreenUpdating = False
Application.DisplayAlerts = False
' Process XML files
strFile = Dir(PATH & "sample*.xml")
Do Until strFile = ""
' Create worksheet
Set wksWorksheet = Sheets.Add
wksWorksheet.Name = Replace(strFile, ".xml", "", , , vbTextCompare)
wksWorksheet.Range("A1") = "File"
wksWorksheet.Range("B1") = "Text"
wksWorksheet.Range("C1") = "CID"
lngTextRow = 2 ' Skip header row
lngCIDRow = 2 ' Skip header row
Set colTexts = New Collection
Set colCIDs = New Collection
' Parse XML
Set docDocument = New MSXML2.DOMDocument60
docDocument.async = False
docDocument.setProperty "ProhibitDTD", False
docDocument.validateOnParse = False
If docDocument.Load(PATH & strFile) Then
' Find 'tree' node
For Each nodTree In docDocument.childNodes
If nodTree.baseName = "tree" Then
' Process 'node' nodes
For Each nodNode In nodTree.childNodes
If nodNode.baseName = "node" Then
' Process attributes
For Each attAttribute In nodNode.attributes
Select Case attAttribute.baseName
' Add text value, if unique
Case "text"
varValue = Null
On Error Resume Next ' Ignore missing value
varValue = colTexts(attAttribute.Value)
On Error GoTo 0
If Nz(attAttribute.Value) <> "" And _
IsNull(varValue) Then ' Skip empty or existing values
wksWorksheet.Range("A" & lngTextRow) = PATH & strFile
wksWorksheet.Range("B" & lngTextRow) = attAttribute.Value
colTexts.Add attAttribute.Value, attAttribute.Value
lngTextRow = lngTextRow + 1
End If
' Add CID value, if unique
Case "CID"
varValue = Null
On Error Resume Next ' Ignore missing value
varValue = colCIDs(attAttribute.Value)
On Error GoTo 0
If Nz(attAttribute.Value) <> "" And _
IsNull(varValue) Then ' Skip empty or existing values
wksWorksheet.Range("A" & lngCIDRow) = PATH & strFile
wksWorksheet.Range("C" & lngCIDRow) = attAttribute.Value
colCIDs.Add attAttribute.Value, attAttribute.Value
lngCIDRow = lngCIDRow + 1
End If
End Select
Set attAttribute = Nothing
Next
End If
Set nodNode = Nothing
Next
End If
Set nodTree = Nothing
Next
End If
Set docDocument = Nothing
Set colTexts = Nothing
Set colCIDs = Nothing
' Resize columns
wksWorksheet.Cells.EntireColumn.AutoFit
Set wksWorksheet = Nothing
' Get next XML file
strFile = Dir
Loop
' Enable screen updates
Application.ScreenUpdating = True
Application.DisplayAlerts = True
End Sub
ASKER
WOw - lots of responses - thanks for all the consideration gang!
JW - your last macro is generating an error - sub not defined on this line:
If Nz(attAttribute.Value) <> "" And _
Also you asked:
Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xsl m" spreadsheet only lists c01456783 in the sample2 worksheet.
Is this just an oversight?
Yes it is - I was trying to show that we don't include the same value twice.
Aikimark - yes the first column is most likely redundant, but could be used in a pinch to identify precisely where the XML file is when I reverse this process and try to repopulate the XML with translated spreadsheet values.
Bill
JW - your last macro is generating an error - sub not defined on this line:
If Nz(attAttribute.Value) <> "" And _
Also you asked:
Your "sample2.xml" file has CID values c01456783 and c01895783, but your "what_Im_hoping_to_get.xsl
Is this just an oversight?
Yes it is - I was trying to show that we don't include the same value twice.
Aikimark - yes the first column is most likely redundant, but could be used in a pinch to identify precisely where the XML file is when I reverse this process and try to repopulate the XML with translated spreadsheet values.
Bill
@Bill
>>..reverse this process and try to repopulate the XML with translated spreadsheet values.
I think you'd better describe this ASAP. Since you aren't bringing in all the data, some data will be lost. Also, there might be a MUCH simpler way to accomplish your end goal.
>>..reverse this process and try to repopulate the XML with translated spreadsheet values.
I think you'd better describe this ASAP. Since you aren't bringing in all the data, some data will be lost. Also, there might be a MUCH simpler way to accomplish your end goal.
ASKER
Aikimark - I have a separate question started wuth regard to my last point - and as long as that function replaces all matching instances, then it doesn't matter if I have removed duplicates - as long as I have one instance. At least that's my thought.
Feel free to check out http:Q_26888950.html although I think there are already two solutions. I just need to get this first part right to have a legitimate test file to verify both stages are working.
Awesome help on these topics though. Thank for your time!
Bill
Feel free to check out http:Q_26888950.html although I think there are already two solutions. I just need to get this first part right to have a legitimate test file to verify both stages are working.
Awesome help on these topics though. Thank for your time!
Bill
@Bill
It is customary to get a solution to each part of your question/problem before continuing with the next part. Things change.
Also, you will see an Ask a Related Question link once you arrive at a solution in a thread.
It is customary to get a solution to each part of your question/problem before continuing with the next part. Things change.
Also, you will see an Ask a Related Question link once you arrive at a solution in a thread.
ASKER
JW - any insight on the error with Nz?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for the help!