Avatar of Salvador T.
Salvador T.
 asked on

Removing information on Duplicate XML Nodes

Hi,
I have the following XML

<?xml version="1.0" encoding="UTF-8"?>
<FILES ORIGINALFILE="TEST.xml">
      <FILETYPE NUMBER="ABC">
            <INFORMATION>
                  <COMPLETE>0</COMPLETE>
                  <TESTRESULTS>0</TESTRESULTS>
                  <CODE>STDEDF</CODE>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="AAAA">C:\TEST\A</KEY>
                  <KEY NAME="BBBB">C:\TEST\B</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">AB12345</PART>
                  <PART="JA1234">BD1234</PART>
            </CUSTOM>
            <INFORMATION>
                  <AGE>220</AGE>
                  <TIME>12:00:07</TIME>
                  <COMPLETE>0</COMPLETE>
                  <ULTS>0</ULTS>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\C</KEY>
                  <KEY NAME="DDDD">C:\TEST\D</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">BB12345</PART>
                  <PART="JA1234">CC1234</PART>
            </CUSTOM>
            <INFORMATION>
                  <NAME>TEST</NAME>
                  <TYPE>MOUSE</TYPE>
                  <TEST>TRUE</TEST>
                  <STATION>NODATA</STATION>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\E</KEY>
                  <KEY NAME="DDDD">C:\TEST\F</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">FF12345</PART>
                  <PART="JA1234">GG1234</PART>
            </CUSTOM>
      </FILETYPE>
</FILES>


How do I remove the Duplicate <INFORMATION>, <CINFO> and <CUSTOM> nodes.. I would like to keep only the newest Nodes like:
<?xml version="1.0" encoding="UTF-8"?>
<FILES ORIGINALFILE="TEST.xml">
      <FILETYPE NUMBER="ABC">
            <INFORMATION>
                  <NAME>TEST</NAME>
                  <TYPE>MOUSE</TYPE>
                  <TEST>TRUE</TEST>
                  <STATION>NODATA</STATION>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\E</KEY>
                  <KEY NAME="DDDD">C:\TEST\F</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">FF12345</PART>
                  <PART="JA1234">GG1234</PART>
            </CUSTOM>
      </FILETYPE>
</FILES>

Thank you!
.NET ProgrammingXMLASP.NETVisual Basic.NET

Avatar of undefined
Last Comment
Salvador T.

8/22/2022 - Mon
Darrell Porter

What does your normalized XML schema look like?  How do you differentiate Information and Custom elements by age to determine which is "newest"?
If you're asking how to delete nodes (from <Information> to </Information) from the XML, which language are you writing in?  VB.NET within Visual Studio version XXXX?

You cannot simply use ParentNode.RemoveChild because this would delete all such named child nodes.  Somehow I think you need to resolve the issue of node identification first.

Reference: https://msdn.microsoft.com/en-us/library/mt692982.aspx
Salvador T.

ASKER
Hello,
The Nodes from the bottom are the newest. VB.NET

Thx,
Salvador T.

ASKER
hmmmm can we look for each node and if it exist more than once then delete previous parent.nodes  ?
All of life is about relationships, and EE has made a viirtual community a real community. It lifts everyone's boat
William Peck
Fernando Soto

Hi Salvador;

The XML document that you posted is not well formed. For example the following CUSTOM node has an issue either then Tag name is missing or an Attribute name is missing.
<CUSTOM>
    <PART="12345-1233">AB12345</PART>
    <PART="JA1234">BD1234</PART>
</CUSTOM>

Open in new window

A properly formed node has the following format
<TAGNAME ATTRIBUTENAME="ATTRIBUTE VALUE">Inner text here</TAGNAME>

Open in new window

When you say this, "How do I remove the Duplicate <INFORMATION>, <CINFO> and <CUSTOM> nodes.", each of the child nodes in each parent node must have the same number of children and all values node name, attributes and inner text must be the same, is that correct?
Darrell Porter

Fernando, this was my issue with this "XML" file.
Properly formatted XML has strict rules which apply to the format of nodes.
I would refer you to http://www.w3schools.com/xml/xml_syntax.asp
Fernando Soto

Hi WalkaboutTigger;

I known that and the reason for my post to Salvador who posted the question.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Darrell Porter

@Fernando - my apologies - the comments about XML following strict rules and the link was directed at Salvador.
Fernando Soto

@WalkaboutTigger not a problem, I figured as much. Have a great day.
Salvador T.

ASKER
Hi WalaboutTigger / Fernando,
what about if I have an element

Time=19:02:08
on each Node.. could that help to eliminate the Old ones ?

thx,
Salvador T.
This is the best money I have ever spent. I cannot not tell you how many times these folks have saved my bacon. I learn so much from the contributors.
rwheeler23
Fernando Soto

Can you show what the new XML would look like.
Darrell Porter

I would recommend each node receive a time attribute:

Dim unixTime As Integer
unixTime = (DateTime.UtcNow - New DateTime(1970, 1, 1, 0, 0, 0)).TotalSeconds

Open in new window

so your Information, CInfo, and Custom nodes which are from the same write process would have the same timestamp

<INFORMATION time="123456789">
  <CINFO time="123456789">
  </CINFO>
</INFORMATION>

Open in new window


You would need to recalculate unixTime each time you write a grouped set of data to the XML file.
Darrell Porter

I would highly recommend you write an XML Schema for your XML file.  This is especially true if you are writing code for which the generated XML

  • will need to be processed or parsed by industry-standard XML  tools,
  • will be maintained or used by someone other than yourself,
  • will be sold or provided to outside entities such as customers or business partners,
  • will be in use for purposes beyond the need for this program to manipulate.


An XML schema, commonly known as an XML Schema Definition (XSD), formally describes what a given XML document can contain, in the same way that a database schema describes the data that can be contained in a database (i.e. table structure, data types, constraints etc.).

The XML schema defines the shape, or structure, of an XML document, along with rules for data content and semantics such as what fields an element can contain, which sub elements it can contain and how many items can be present.

It can also describe the type and values that can be placed into each element or attribute. The XML data constraints are called facets and include rules such as min and max length.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Darrell Porter

If I take your first XML example and modify the <Part="..."> nodes to read <PART NAME="..."> then I have the following, overly-complex schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="FILES">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="FILETYPE">
          <xs:complexType>
            <xs:sequence>
              <xs:choice maxOccurs="unbounded">
                <xs:element name="INFORMATION">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element minOccurs="0" name="NAME" type="xs:string" />
                      <xs:element minOccurs="0" name="TYPE" type="xs:string" />
                      <xs:element minOccurs="0" name="TEST" type="xs:string" />
                      <xs:element minOccurs="0" name="STATION" type="xs:string" />
                      <xs:element minOccurs="0" name="AGE" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="TIME" type="xs:time" />
                      <xs:element minOccurs="0" name="COMPLETE" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="ULTS" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="TESTRESULTS" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="CODE" type="xs:string" />
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
                <xs:element name="CINFO">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element maxOccurs="unbounded" name="KEY">
                        <xs:complexType>
                          <xs:simpleContent>
                            <xs:extension base="xs:string">
                              <xs:attribute name="NAME" type="xs:string" use="required" />
                            </xs:extension>
                          </xs:simpleContent>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
                <xs:element name="CUSTOM">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element maxOccurs="unbounded" name="PART">
                        <xs:complexType>
                          <xs:simpleContent>
                            <xs:extension base="xs:string">
                              <xs:attribute name="NAME" type="xs:string" use="required" />
                            </xs:extension>
                          </xs:simpleContent>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
              </xs:choice>
            </xs:sequence>
            <xs:attribute name="NUMBER" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="ORIGINALFILE" type="xs:string" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

Open in new window


If you could provide some insight into what you're ultimately trying to accomplish with this XML file, we may be able to provide you a more definitive answer to your initial question.
Darrell Porter

Any updates, @Salvador?
Salvador T.

ASKER
Hello WalaboutTigger,
I'm trying to implement the previous solution provided by Fernando on another question.
Dim fileName As String = "Victor2.xml"                                                    
Dim xdoc As XDocument = XDocument.Load(fileName)                                          
                                                                                           
' Find the duplicate nodes in the XML document                                            
Dim results = (From n In xdoc.Descendants("table") _                                      
               Group n By Item = n.Element("Item").Value.ToLower() Into itemGroup = Group _
               Where itemGroup.Count > 1 _                                                
               From i In itemGroup.Skip(1) _                                              
               Select i).ToList()                                                          
                                                                                           
' Remove the duplicates from xdoc                                                          
results.ForEach(Sub(d) d.Remove())                                                        
' Save the modified xdoc to the file system                                                
xdoc.Save(fileName

thx,
Salvador T.
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy
Fernando Soto

Hi Salvador;

The solution you posted which I provided to another EE user will not work in your case because the Linq to XML follows strict rules for naming nodes and querying them. For example your XML is NOT Well-Formed because of nodes like this one.
<PART="12345-1233">AB12345</PART>

Open in new window

when linq tries to load the document having such a node it will give you a run time exception as follows
XmlException: The '=' character, hexadecimal value 0x3D, cannot be included in a name. Line 14, position 24.
Please provide a XML that is Well-Formed so we can provide an acceptable solution.
Salvador T.

ASKER
Hi Fernando,
I was able to reduce my problem to the following:
Here is a simpler XML example

<root>
<city>newyork</city>
<city>newyork</city>
<city>newyork</city>
<city>washington</city>
<city>washington</city>
</root>

the results wanted if to eliminate the Duplicate city Elements in Root.
<root>
<city>newyork</city>
<city>washington</city>
</root>

Thank you,
Salvador T.
Salvador T.

ASKER
also. this is another scenario

<root>
<timestamp1>20160912 152100</timestamp1>
<timestamp2>20160912 152100</timestamp2>
<state>ny</state>
<timestamp1>20160912 152100</timestamp1>
</root>

the results wanted would be to eliminate the Duplicate TimeStamp1 element
<root>
<timestamp1>20160912 152100</timestamp1>
<timestamp2>20160912 162100</timestamp2>
<state>ny</state>
</root>

thx.
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
SOLUTION
Fernando Soto

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
Salvador T.

ASKER
Thank you Fernando.
I tested your code and the value for results variable is 0 so no elements are removed.. I guess because it's looking for duplicate Nodes. Please find the actual XML below:

<?xml version="1.0" encoding="utf-8"?>
<UNITS ORIGINALFILE="PNA001_SN00001_20160821203148.UNIT.xml">
  <UNIT SERIALNUMBER="SN00001">
    <UNIT_DATA>
      <PRODUCTSN>SN00001</PRODUCTSN>
      <PRODUCT>PNA001</PRODUCT>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
      <TIME_FINISHED>20160821 203148</TIME_FINISHED>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
    </UNIT_DATA>
   </UNIT>
</UNITS>

What I would like to obtain is:

<?xml version="1.0" encoding="utf-8"?>
<UNITS ORIGINALFILE="PNA001_SN00001_20160821203148.UNIT.xml">
  <UNIT SERIALNUMBER="SN00001">
    <UNIT_DATA>
      <PRODUCTSN>SN00001</PRODUCTSN>
      <PRODUCT>PNA001</PRODUCT>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
      <TIME_FINISHED>20160821 203148</TIME_FINISHED>
    </UNIT_DATA>
   </UNIT>
</UNITS>

Removing the Duplicate Element: inside <UNIT_DATA> node

      <TIME_STARTED>20160821 202948</TIME_STARTED>

Thank you.
ASKER CERTIFIED SOLUTION
Fernando Soto

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.
Salvador T.

ASKER
Thank you Fernando for all your help!!!.. that works perfect!
Fernando Soto

Not a problem Salvador, glad to help. Please do not forget to mark the solution as the answer to the question.

Thank you.
Your help has saved me hundreds of hours of internet surfing.
fblack61
Salvador T.

ASKER
Thank you!!!!