Removing information on Duplicate XML Nodes

Hi,
I have the following XML

<?xml version="1.0" encoding="UTF-8"?>
<FILES ORIGINALFILE="TEST.xml">
      <FILETYPE NUMBER="ABC">
            <INFORMATION>
                  <COMPLETE>0</COMPLETE>
                  <TESTRESULTS>0</TESTRESULTS>
                  <CODE>STDEDF</CODE>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="AAAA">C:\TEST\A</KEY>
                  <KEY NAME="BBBB">C:\TEST\B</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">AB12345</PART>
                  <PART="JA1234">BD1234</PART>
            </CUSTOM>
            <INFORMATION>
                  <AGE>220</AGE>
                  <TIME>12:00:07</TIME>
                  <COMPLETE>0</COMPLETE>
                  <ULTS>0</ULTS>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\C</KEY>
                  <KEY NAME="DDDD">C:\TEST\D</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">BB12345</PART>
                  <PART="JA1234">CC1234</PART>
            </CUSTOM>
            <INFORMATION>
                  <NAME>TEST</NAME>
                  <TYPE>MOUSE</TYPE>
                  <TEST>TRUE</TEST>
                  <STATION>NODATA</STATION>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\E</KEY>
                  <KEY NAME="DDDD">C:\TEST\F</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">FF12345</PART>
                  <PART="JA1234">GG1234</PART>
            </CUSTOM>
      </FILETYPE>
</FILES>


How do I remove the Duplicate <INFORMATION>, <CINFO> and <CUSTOM> nodes.. I would like to keep only the newest Nodes like:
<?xml version="1.0" encoding="UTF-8"?>
<FILES ORIGINALFILE="TEST.xml">
      <FILETYPE NUMBER="ABC">
            <INFORMATION>
                  <NAME>TEST</NAME>
                  <TYPE>MOUSE</TYPE>
                  <TEST>TRUE</TEST>
                  <STATION>NODATA</STATION>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\E</KEY>
                  <KEY NAME="DDDD">C:\TEST\F</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">FF12345</PART>
                  <PART="JA1234">GG1234</PART>
            </CUSTOM>
      </FILETYPE>
</FILES>

Thank you!
Salvador T.Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Darrell PorterEnterprise Business Process ArchitectCommented:
What does your normalized XML schema look like?  How do you differentiate Information and Custom elements by age to determine which is "newest"?
If you're asking how to delete nodes (from <Information> to </Information) from the XML, which language are you writing in?  VB.NET within Visual Studio version XXXX?

You cannot simply use ParentNode.RemoveChild because this would delete all such named child nodes.  Somehow I think you need to resolve the issue of node identification first.

Reference: https://msdn.microsoft.com/en-us/library/mt692982.aspx
Salvador T.Author Commented:
Hello,
The Nodes from the bottom are the newest. VB.NET

Thx,
Salvador T.Author Commented:
hmmmm can we look for each node and if it exist more than once then delete previous parent.nodes  ?
Exploring ASP.NET Core: Fundamentals

Learn to build web apps and services, IoT apps, and mobile backends by covering the fundamentals of ASP.NET Core and  exploring the core foundations for app libraries.

Fernando SotoRetiredCommented:
Hi Salvador;

The XML document that you posted is not well formed. For example the following CUSTOM node has an issue either then Tag name is missing or an Attribute name is missing.
<CUSTOM>
    <PART="12345-1233">AB12345</PART>
    <PART="JA1234">BD1234</PART>
</CUSTOM>

Open in new window

A properly formed node has the following format
<TAGNAME ATTRIBUTENAME="ATTRIBUTE VALUE">Inner text here</TAGNAME>

Open in new window

When you say this, "How do I remove the Duplicate <INFORMATION>, <CINFO> and <CUSTOM> nodes.", each of the child nodes in each parent node must have the same number of children and all values node name, attributes and inner text must be the same, is that correct?
Darrell PorterEnterprise Business Process ArchitectCommented:
Fernando, this was my issue with this "XML" file.
Properly formatted XML has strict rules which apply to the format of nodes.
I would refer you to http://www.w3schools.com/xml/xml_syntax.asp
Fernando SotoRetiredCommented:
Hi WalkaboutTigger;

I known that and the reason for my post to Salvador who posted the question.
Darrell PorterEnterprise Business Process ArchitectCommented:
@Fernando - my apologies - the comments about XML following strict rules and the link was directed at Salvador.
Fernando SotoRetiredCommented:
@WalkaboutTigger not a problem, I figured as much. Have a great day.
Salvador T.Author Commented:
Hi WalaboutTigger / Fernando,
what about if I have an element

Time=19:02:08
on each Node.. could that help to eliminate the Old ones ?

thx,
Salvador T.
Fernando SotoRetiredCommented:
Can you show what the new XML would look like.
Darrell PorterEnterprise Business Process ArchitectCommented:
I would recommend each node receive a time attribute:

Dim unixTime As Integer
unixTime = (DateTime.UtcNow - New DateTime(1970, 1, 1, 0, 0, 0)).TotalSeconds

Open in new window

so your Information, CInfo, and Custom nodes which are from the same write process would have the same timestamp

<INFORMATION time="123456789">
  <CINFO time="123456789">
  </CINFO>
</INFORMATION>

Open in new window


You would need to recalculate unixTime each time you write a grouped set of data to the XML file.
Darrell PorterEnterprise Business Process ArchitectCommented:
I would highly recommend you write an XML Schema for your XML file.  This is especially true if you are writing code for which the generated XML

  • will need to be processed or parsed by industry-standard XML  tools,
  • will be maintained or used by someone other than yourself,
  • will be sold or provided to outside entities such as customers or business partners,
  • will be in use for purposes beyond the need for this program to manipulate.


An XML schema, commonly known as an XML Schema Definition (XSD), formally describes what a given XML document can contain, in the same way that a database schema describes the data that can be contained in a database (i.e. table structure, data types, constraints etc.).

The XML schema defines the shape, or structure, of an XML document, along with rules for data content and semantics such as what fields an element can contain, which sub elements it can contain and how many items can be present.

It can also describe the type and values that can be placed into each element or attribute. The XML data constraints are called facets and include rules such as min and max length.
Darrell PorterEnterprise Business Process ArchitectCommented:
If I take your first XML example and modify the <Part="..."> nodes to read <PART NAME="..."> then I have the following, overly-complex schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="FILES">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="FILETYPE">
          <xs:complexType>
            <xs:sequence>
              <xs:choice maxOccurs="unbounded">
                <xs:element name="INFORMATION">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element minOccurs="0" name="NAME" type="xs:string" />
                      <xs:element minOccurs="0" name="TYPE" type="xs:string" />
                      <xs:element minOccurs="0" name="TEST" type="xs:string" />
                      <xs:element minOccurs="0" name="STATION" type="xs:string" />
                      <xs:element minOccurs="0" name="AGE" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="TIME" type="xs:time" />
                      <xs:element minOccurs="0" name="COMPLETE" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="ULTS" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="TESTRESULTS" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="CODE" type="xs:string" />
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
                <xs:element name="CINFO">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element maxOccurs="unbounded" name="KEY">
                        <xs:complexType>
                          <xs:simpleContent>
                            <xs:extension base="xs:string">
                              <xs:attribute name="NAME" type="xs:string" use="required" />
                            </xs:extension>
                          </xs:simpleContent>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
                <xs:element name="CUSTOM">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element maxOccurs="unbounded" name="PART">
                        <xs:complexType>
                          <xs:simpleContent>
                            <xs:extension base="xs:string">
                              <xs:attribute name="NAME" type="xs:string" use="required" />
                            </xs:extension>
                          </xs:simpleContent>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
              </xs:choice>
            </xs:sequence>
            <xs:attribute name="NUMBER" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="ORIGINALFILE" type="xs:string" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

Open in new window


If you could provide some insight into what you're ultimately trying to accomplish with this XML file, we may be able to provide you a more definitive answer to your initial question.
Darrell PorterEnterprise Business Process ArchitectCommented:
Any updates, @Salvador?
Salvador T.Author Commented:
Hello WalaboutTigger,
I'm trying to implement the previous solution provided by Fernando on another question.
Dim fileName As String = "Victor2.xml"                                                    
Dim xdoc As XDocument = XDocument.Load(fileName)                                          
                                                                                           
' Find the duplicate nodes in the XML document                                            
Dim results = (From n In xdoc.Descendants("table") _                                      
               Group n By Item = n.Element("Item").Value.ToLower() Into itemGroup = Group _
               Where itemGroup.Count > 1 _                                                
               From i In itemGroup.Skip(1) _                                              
               Select i).ToList()                                                          
                                                                                           
' Remove the duplicates from xdoc                                                          
results.ForEach(Sub(d) d.Remove())                                                        
' Save the modified xdoc to the file system                                                
xdoc.Save(fileName

thx,
Salvador T.
Fernando SotoRetiredCommented:
Hi Salvador;

The solution you posted which I provided to another EE user will not work in your case because the Linq to XML follows strict rules for naming nodes and querying them. For example your XML is NOT Well-Formed because of nodes like this one.
<PART="12345-1233">AB12345</PART>

Open in new window

when linq tries to load the document having such a node it will give you a run time exception as follows
XmlException: The '=' character, hexadecimal value 0x3D, cannot be included in a name. Line 14, position 24.
Please provide a XML that is Well-Formed so we can provide an acceptable solution.
Salvador T.Author Commented:
Hi Fernando,
I was able to reduce my problem to the following:
Here is a simpler XML example

<root>
<city>newyork</city>
<city>newyork</city>
<city>newyork</city>
<city>washington</city>
<city>washington</city>
</root>

the results wanted if to eliminate the Duplicate city Elements in Root.
<root>
<city>newyork</city>
<city>washington</city>
</root>

Thank you,
Salvador T.
Salvador T.Author Commented:
also. this is another scenario

<root>
<timestamp1>20160912 152100</timestamp1>
<timestamp2>20160912 152100</timestamp2>
<state>ny</state>
<timestamp1>20160912 152100</timestamp1>
</root>

the results wanted would be to eliminate the Duplicate TimeStamp1 element
<root>
<timestamp1>20160912 152100</timestamp1>
<timestamp2>20160912 162100</timestamp2>
<state>ny</state>
</root>

thx.
Fernando SotoRetiredCommented:
Hi Salvador;

The following code will work with the last two XML documents you posted.
' File name and path without file extension
Dim fileName = "C:\Working Directory\Salvador3"
Dim xdoc = XDocument.Load(fileName & ".xml")

Dim results = (From c In xdoc.Root.Elements()
			   Group c By Key = c.Name.LocalName & ":" & c.Value.ToLower() Into nodeGroup = Group
			   Where nodeGroup.Count > 1
			   Let toBeRemoved = nodeGroup.Take(nodeGroup.Count() - 1)
			   From s In toBeRemoved 
			   Select s).ToList()

' Remove the duplicates from xdoc                                                           
results.ForEach(Sub(d) d.Remove())

' Save the modified xdoc to the file system                   
xdoc.Save(fileName & "_New" & ".xml")

Open in new window

Salvador T.Author Commented:
Thank you Fernando.
I tested your code and the value for results variable is 0 so no elements are removed.. I guess because it's looking for duplicate Nodes. Please find the actual XML below:

<?xml version="1.0" encoding="utf-8"?>
<UNITS ORIGINALFILE="PNA001_SN00001_20160821203148.UNIT.xml">
  <UNIT SERIALNUMBER="SN00001">
    <UNIT_DATA>
      <PRODUCTSN>SN00001</PRODUCTSN>
      <PRODUCT>PNA001</PRODUCT>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
      <TIME_FINISHED>20160821 203148</TIME_FINISHED>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
    </UNIT_DATA>
   </UNIT>
</UNITS>

What I would like to obtain is:

<?xml version="1.0" encoding="utf-8"?>
<UNITS ORIGINALFILE="PNA001_SN00001_20160821203148.UNIT.xml">
  <UNIT SERIALNUMBER="SN00001">
    <UNIT_DATA>
      <PRODUCTSN>SN00001</PRODUCTSN>
      <PRODUCT>PNA001</PRODUCT>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
      <TIME_FINISHED>20160821 203148</TIME_FINISHED>
    </UNIT_DATA>
   </UNIT>
</UNITS>

Removing the Duplicate Element: inside <UNIT_DATA> node

      <TIME_STARTED>20160821 202948</TIME_STARTED>

Thank you.
Fernando SotoRetiredCommented:
Hi Salvador;

I made this point in a previous post,
The solution you posted which I provided to another EE user will not work in your case because the Linq to XML follows strict rules for naming nodes and querying them.
and the reason I stated this in my last post.
The following code will work with the last two XML documents you posted.
When you query XML documents from Linq to XML or using XmlDocument they follow strict rules for naming nodes and querying them. Think of the XML document as a map. To get from point A to point B you follow the instructions. If you follow those same instructions from point C to point B you will not arrive at the destination. To get my previous solution to work with this new document you will need to change this line in the code
Dim results = (From c In xdoc.Root.Elements()

Open in new window

to this
Dim results = (From c In xdoc.Root.Descendants("UNIT_DATA").Elements()

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Salvador T.Author Commented:
Thank you Fernando for all your help!!!.. that works perfect!
Fernando SotoRetiredCommented:
Not a problem Salvador, glad to help. Please do not forget to mark the solution as the answer to the question.

Thank you.
Salvador T.Author Commented:
Thank you!!!!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.