Solved

Removing information on Duplicate XML Nodes

Posted on 2016-09-02
24
66 Views
Last Modified: 2016-09-13
Hi,
I have the following XML

<?xml version="1.0" encoding="UTF-8"?>
<FILES ORIGINALFILE="TEST.xml">
      <FILETYPE NUMBER="ABC">
            <INFORMATION>
                  <COMPLETE>0</COMPLETE>
                  <TESTRESULTS>0</TESTRESULTS>
                  <CODE>STDEDF</CODE>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="AAAA">C:\TEST\A</KEY>
                  <KEY NAME="BBBB">C:\TEST\B</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">AB12345</PART>
                  <PART="JA1234">BD1234</PART>
            </CUSTOM>
            <INFORMATION>
                  <AGE>220</AGE>
                  <TIME>12:00:07</TIME>
                  <COMPLETE>0</COMPLETE>
                  <ULTS>0</ULTS>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\C</KEY>
                  <KEY NAME="DDDD">C:\TEST\D</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">BB12345</PART>
                  <PART="JA1234">CC1234</PART>
            </CUSTOM>
            <INFORMATION>
                  <NAME>TEST</NAME>
                  <TYPE>MOUSE</TYPE>
                  <TEST>TRUE</TEST>
                  <STATION>NODATA</STATION>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\E</KEY>
                  <KEY NAME="DDDD">C:\TEST\F</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">FF12345</PART>
                  <PART="JA1234">GG1234</PART>
            </CUSTOM>
      </FILETYPE>
</FILES>


How do I remove the Duplicate <INFORMATION>, <CINFO> and <CUSTOM> nodes.. I would like to keep only the newest Nodes like:
<?xml version="1.0" encoding="UTF-8"?>
<FILES ORIGINALFILE="TEST.xml">
      <FILETYPE NUMBER="ABC">
            <INFORMATION>
                  <NAME>TEST</NAME>
                  <TYPE>MOUSE</TYPE>
                  <TEST>TRUE</TEST>
                  <STATION>NODATA</STATION>
            </INFORMATION>
            <CINFO>
                  <KEY NAME="CCCC">C:\TEST\E</KEY>
                  <KEY NAME="DDDD">C:\TEST\F</KEY>
            </CINFO>
            <CUSTOM>
                  <PART="12345-1233">FF12345</PART>
                  <PART="JA1234">GG1234</PART>
            </CUSTOM>
      </FILETYPE>
</FILES>

Thank you!
0
Comment
Question by:Salvador T.
  • 9
  • 8
  • 7
24 Comments
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41782453
What does your normalized XML schema look like?  How do you differentiate Information and Custom elements by age to determine which is "newest"?
If you're asking how to delete nodes (from <Information> to </Information) from the XML, which language are you writing in?  VB.NET within Visual Studio version XXXX?

You cannot simply use ParentNode.RemoveChild because this would delete all such named child nodes.  Somehow I think you need to resolve the issue of node identification first.

Reference: https://msdn.microsoft.com/en-us/library/mt692982.aspx
0
 

Author Comment

by:Salvador T.
ID: 41782454
Hello,
The Nodes from the bottom are the newest. VB.NET

Thx,
0
 

Author Comment

by:Salvador T.
ID: 41782459
hmmmm can we look for each node and if it exist more than once then delete previous parent.nodes  ?
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41782487
Hi Salvador;

The XML document that you posted is not well formed. For example the following CUSTOM node has an issue either then Tag name is missing or an Attribute name is missing.
<CUSTOM>
    <PART="12345-1233">AB12345</PART>
    <PART="JA1234">BD1234</PART>
</CUSTOM>

Open in new window

A properly formed node has the following format
<TAGNAME ATTRIBUTENAME="ATTRIBUTE VALUE">Inner text here</TAGNAME>

Open in new window

When you say this, "How do I remove the Duplicate <INFORMATION>, <CINFO> and <CUSTOM> nodes.", each of the child nodes in each parent node must have the same number of children and all values node name, attributes and inner text must be the same, is that correct?
1
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41783960
Fernando, this was my issue with this "XML" file.
Properly formatted XML has strict rules which apply to the format of nodes.
I would refer you to http://www.w3schools.com/xml/xml_syntax.asp
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41784015
Hi WalkaboutTigger;

I known that and the reason for my post to Salvador who posted the question.
0
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41786393
@Fernando - my apologies - the comments about XML following strict rules and the link was directed at Salvador.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41786398
@WalkaboutTigger not a problem, I figured as much. Have a great day.
0
 

Author Comment

by:Salvador T.
ID: 41786531
Hi WalaboutTigger / Fernando,
what about if I have an element

Time=19:02:08
on each Node.. could that help to eliminate the Old ones ?

thx,
Salvador T.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41786534
Can you show what the new XML would look like.
0
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41786582
I would recommend each node receive a time attribute:

Dim unixTime As Integer
unixTime = (DateTime.UtcNow - New DateTime(1970, 1, 1, 0, 0, 0)).TotalSeconds

Open in new window

so your Information, CInfo, and Custom nodes which are from the same write process would have the same timestamp

<INFORMATION time="123456789">
  <CINFO time="123456789">
  </CINFO>
</INFORMATION>

Open in new window


You would need to recalculate unixTime each time you write a grouped set of data to the XML file.
0
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41786596
I would highly recommend you write an XML Schema for your XML file.  This is especially true if you are writing code for which the generated XML

  • will need to be processed or parsed by industry-standard XML  tools,
  • will be maintained or used by someone other than yourself,
  • will be sold or provided to outside entities such as customers or business partners,
  • will be in use for purposes beyond the need for this program to manipulate.


An XML schema, commonly known as an XML Schema Definition (XSD), formally describes what a given XML document can contain, in the same way that a database schema describes the data that can be contained in a database (i.e. table structure, data types, constraints etc.).

The XML schema defines the shape, or structure, of an XML document, along with rules for data content and semantics such as what fields an element can contain, which sub elements it can contain and how many items can be present.

It can also describe the type and values that can be placed into each element or attribute. The XML data constraints are called facets and include rules such as min and max length.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41786631
If I take your first XML example and modify the <Part="..."> nodes to read <PART NAME="..."> then I have the following, overly-complex schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="FILES">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="FILETYPE">
          <xs:complexType>
            <xs:sequence>
              <xs:choice maxOccurs="unbounded">
                <xs:element name="INFORMATION">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element minOccurs="0" name="NAME" type="xs:string" />
                      <xs:element minOccurs="0" name="TYPE" type="xs:string" />
                      <xs:element minOccurs="0" name="TEST" type="xs:string" />
                      <xs:element minOccurs="0" name="STATION" type="xs:string" />
                      <xs:element minOccurs="0" name="AGE" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="TIME" type="xs:time" />
                      <xs:element minOccurs="0" name="COMPLETE" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="ULTS" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="TESTRESULTS" type="xs:unsignedByte" />
                      <xs:element minOccurs="0" name="CODE" type="xs:string" />
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
                <xs:element name="CINFO">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element maxOccurs="unbounded" name="KEY">
                        <xs:complexType>
                          <xs:simpleContent>
                            <xs:extension base="xs:string">
                              <xs:attribute name="NAME" type="xs:string" use="required" />
                            </xs:extension>
                          </xs:simpleContent>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
                <xs:element name="CUSTOM">
                  <xs:complexType>
                    <xs:sequence>
                      <xs:element maxOccurs="unbounded" name="PART">
                        <xs:complexType>
                          <xs:simpleContent>
                            <xs:extension base="xs:string">
                              <xs:attribute name="NAME" type="xs:string" use="required" />
                            </xs:extension>
                          </xs:simpleContent>
                        </xs:complexType>
                      </xs:element>
                    </xs:sequence>
                  </xs:complexType>
                </xs:element>
              </xs:choice>
            </xs:sequence>
            <xs:attribute name="NUMBER" type="xs:string" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="ORIGINALFILE" type="xs:string" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

Open in new window


If you could provide some insight into what you're ultimately trying to accomplish with this XML file, we may be able to provide you a more definitive answer to your initial question.
0
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 41789974
Any updates, @Salvador?
0
 

Author Comment

by:Salvador T.
ID: 41795020
Hello WalaboutTigger,
I'm trying to implement the previous solution provided by Fernando on another question.
Dim fileName As String = "Victor2.xml"                                                    
Dim xdoc As XDocument = XDocument.Load(fileName)                                          
                                                                                           
' Find the duplicate nodes in the XML document                                            
Dim results = (From n In xdoc.Descendants("table") _                                      
               Group n By Item = n.Element("Item").Value.ToLower() Into itemGroup = Group _
               Where itemGroup.Count > 1 _                                                
               From i In itemGroup.Skip(1) _                                              
               Select i).ToList()                                                          
                                                                                           
' Remove the duplicates from xdoc                                                          
results.ForEach(Sub(d) d.Remove())                                                        
' Save the modified xdoc to the file system                                                
xdoc.Save(fileName

thx,
Salvador T.
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41795091
Hi Salvador;

The solution you posted which I provided to another EE user will not work in your case because the Linq to XML follows strict rules for naming nodes and querying them. For example your XML is NOT Well-Formed because of nodes like this one.
<PART="12345-1233">AB12345</PART>

Open in new window

when linq tries to load the document having such a node it will give you a run time exception as follows
XmlException: The '=' character, hexadecimal value 0x3D, cannot be included in a name. Line 14, position 24.
Please provide a XML that is Well-Formed so we can provide an acceptable solution.
0
 

Author Comment

by:Salvador T.
ID: 41795151
Hi Fernando,
I was able to reduce my problem to the following:
Here is a simpler XML example

<root>
<city>newyork</city>
<city>newyork</city>
<city>newyork</city>
<city>washington</city>
<city>washington</city>
</root>

the results wanted if to eliminate the Duplicate city Elements in Root.
<root>
<city>newyork</city>
<city>washington</city>
</root>

Thank you,
Salvador T.
0
 

Author Comment

by:Salvador T.
ID: 41795157
also. this is another scenario

<root>
<timestamp1>20160912 152100</timestamp1>
<timestamp2>20160912 152100</timestamp2>
<state>ny</state>
<timestamp1>20160912 152100</timestamp1>
</root>

the results wanted would be to eliminate the Duplicate TimeStamp1 element
<root>
<timestamp1>20160912 152100</timestamp1>
<timestamp2>20160912 162100</timestamp2>
<state>ny</state>
</root>

thx.
0
 
LVL 62

Assisted Solution

by:Fernando Soto
Fernando Soto earned 500 total points
ID: 41795873
Hi Salvador;

The following code will work with the last two XML documents you posted.
' File name and path without file extension
Dim fileName = "C:\Working Directory\Salvador3"
Dim xdoc = XDocument.Load(fileName & ".xml")

Dim results = (From c In xdoc.Root.Elements()
			   Group c By Key = c.Name.LocalName & ":" & c.Value.ToLower() Into nodeGroup = Group
			   Where nodeGroup.Count > 1
			   Let toBeRemoved = nodeGroup.Take(nodeGroup.Count() - 1)
			   From s In toBeRemoved 
			   Select s).ToList()

' Remove the duplicates from xdoc                                                           
results.ForEach(Sub(d) d.Remove())

' Save the modified xdoc to the file system                   
xdoc.Save(fileName & "_New" & ".xml")

Open in new window

1
 

Author Comment

by:Salvador T.
ID: 41796777
Thank you Fernando.
I tested your code and the value for results variable is 0 so no elements are removed.. I guess because it's looking for duplicate Nodes. Please find the actual XML below:

<?xml version="1.0" encoding="utf-8"?>
<UNITS ORIGINALFILE="PNA001_SN00001_20160821203148.UNIT.xml">
  <UNIT SERIALNUMBER="SN00001">
    <UNIT_DATA>
      <PRODUCTSN>SN00001</PRODUCTSN>
      <PRODUCT>PNA001</PRODUCT>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
      <TIME_FINISHED>20160821 203148</TIME_FINISHED>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
    </UNIT_DATA>
   </UNIT>
</UNITS>

What I would like to obtain is:

<?xml version="1.0" encoding="utf-8"?>
<UNITS ORIGINALFILE="PNA001_SN00001_20160821203148.UNIT.xml">
  <UNIT SERIALNUMBER="SN00001">
    <UNIT_DATA>
      <PRODUCTSN>SN00001</PRODUCTSN>
      <PRODUCT>PNA001</PRODUCT>
      <TIME_STARTED>20160821 202948</TIME_STARTED>
      <TIME_FINISHED>20160821 203148</TIME_FINISHED>
    </UNIT_DATA>
   </UNIT>
</UNITS>

Removing the Duplicate Element: inside <UNIT_DATA> node

      <TIME_STARTED>20160821 202948</TIME_STARTED>

Thank you.
0
 
LVL 62

Accepted Solution

by:
Fernando Soto earned 500 total points
ID: 41796862
Hi Salvador;

I made this point in a previous post,
The solution you posted which I provided to another EE user will not work in your case because the Linq to XML follows strict rules for naming nodes and querying them.
and the reason I stated this in my last post.
The following code will work with the last two XML documents you posted.
When you query XML documents from Linq to XML or using XmlDocument they follow strict rules for naming nodes and querying them. Think of the XML document as a map. To get from point A to point B you follow the instructions. If you follow those same instructions from point C to point B you will not arrive at the destination. To get my previous solution to work with this new document you will need to change this line in the code
Dim results = (From c In xdoc.Root.Elements()

Open in new window

to this
Dim results = (From c In xdoc.Root.Descendants("UNIT_DATA").Elements()

Open in new window

1
 

Author Comment

by:Salvador T.
ID: 41796871
Thank you Fernando for all your help!!!.. that works perfect!
0
 
LVL 62

Expert Comment

by:Fernando Soto
ID: 41796905
Not a problem Salvador, glad to help. Please do not forget to mark the solution as the answer to the question.

Thank you.
0
 

Author Closing Comment

by:Salvador T.
ID: 41796924
Thank you!!!!
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
System32Int Error 8 44
Asp.net mvc 5 5 23
Adjust the position 3 16
What are the Important skill to have as Asp.net Developer 8 28
IntroductionWhile developing web applications, a single page might contain many regions and each region might contain many number of controls with the capability to perform  postback. Many times you might need to perform some action on an ASP.NET po…
Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now