Victor Charles
asked on
Help with removing duplicate records in an xml file using VB.NET
Hi,
How do you remove duplicate records in an xml file? For example if I have an xml file with the following data:
<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>2</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>3</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>
How do I obtain the following xml fil?
<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>
The actual xml file contains more data element and more records.
Thanks,
Victor
How do you remove duplicate records in an xml file? For example if I have an xml file with the following data:
<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>2</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>3</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>
How do I obtain the following xml fil?
<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>
The actual xml file contains more data element and more records.
Thanks,
Victor
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you, will try it and get back to you.
@DUSTIN... you are the man! :) :) :) Thanks heavens for LINQ! Check this out without LINQ:
Dim doc As New XmlDocument
doc.Load("c:\temp\duplicates.xml")
Dim duplicates As New NameValueCollection, markedDuplicates As New list(Of String)
For Each node As XmlNode In doc.SelectNodes("//Table")
Try
Dim id As String = node.SelectSingleNode("ID").InnerText
If markedDuplicates.Contains(id) Then Continue For
Dim dupsXPath As String = String.Empty
'We loop through all the nodes under the Table element and build an Xpath to find duplicates nodes
'This loop avoids hard coding the child element names. I see a few new ones from previous questions, AGDC
For Each el As XmlNode In node.ChildNodes
If el.Name = "ID" Then
dupsXPath &= Iif(String.IsNullOrWhiteSpace(dupsXPath), String.Empty, " and ") & String.Format("{0} != ""{1}""", el.Name, el.InnerText)
Else
dupsXPath &= Iif(String.IsNullOrWhiteSpace(dupsXPath), String.Empty, " and ") & String.Format("{0} = ""{1}""", el.Name, el.InnerText)
End If
Next
dupsXPath = String.Format("//Table[{0}]", dupsXPath)
'The XPath returns some nodes. We keep track of which ones they are!
For Each dup As XmlNode In doc.SelectNodes(dupsXPath)
Dim dup_id As String = dup.SelectSingleNode("ID").InnerText
duplicates.Add(id, dup_id)
markedDuplicates.Add(dup_id)
Next
Catch ex As Exception
ex.Dump
End Try
Next
'We remove the identified nodes
Dim duplicate_nodes = duplicates.AllKeys.SelectMany(AddressOf duplicates.GetValues, Function(k, v) New With {.NodeToKeepId = k, .DuplicateNodeId = v}).ToList()
duplicate_nodes.ForEach(Function(node) doc.DocumentElement.RemoveChild(doc.SelectSingleNode(String.Format("//Table[ID=""{0}""]", node.DuplicateNodeId))))
I created a more complete sample input file, which has nodes without duplicates, and a second set of duplicates apart from what Author provided. Both approaches work. But yours is really nice and concise!
<Root>
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>2</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>3</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>4</ID>
<NSC>AAA</NSC>
<AGDC>XXX</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>5</ID>
<NSC>AAA</NSC>
<AGDC>XXX</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>6</ID>
<NSC>XXX</NSC>
<AGDC>YYY</AGDC>
<NSN>ZZZ</NSN>
</Table>
</Root>
<Root>
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>4</ID>
<NSC>AAA</NSC>
<AGDC>XXX</AGDC>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>6</ID>
<NSC>XXX</NSC>
<AGDC>YYY</AGDC>
<NSN>ZZZ</NSN>
</Table>
</Root>
ASKER
Thank You.
ASKER
I need to modify my question, how do I remove duplicate records if NSC, AGD and NSN have the same values? The file contains other data elements. For example,
<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
<FIF>DDD</FIF>
</Table>
<Table>
<ID>2</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
<FIF>DDD</FIF>
</Table>
<Table>
<ID>3</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
<FIF>DDD</FIF>
</Table>
</Root>
Victor