Link to home
Start Free TrialLog in
Avatar of Victor  Charles
Victor CharlesFlag for United States of America

asked on

Help with removing duplicate records in an xml file using VB.NET

Hi,

How do you remove duplicate records in an xml file? For example if I have an xml file with the following data:

<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>2</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>3</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>

How do I obtain the following xml fil?

<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>
The actual xml file contains more data element and more records.

Thanks,

Victor
Avatar of Victor  Charles
Victor Charles
Flag of United States of America image

ASKER

Hi,

I need to modify my question, how do I remove duplicate records if NSC, AGD and NSN have the same values? The file contains other data elements. For example,

<Root?
 <Table>
 <ID>1</ID>
 <NSC>AAA</NSC>
 <AGDC>BBB</AGD>
 <NSN>CCC</NSN>
 <FIF>DDD</FIF>
 </Table>
 <Table>
 <ID>2</ID>
 <NSC>AAA</NSC>
 <AGDC>BBB</AGD>
 <NSN>CCC</NSN>
 <FIF>DDD</FIF>
 </Table>
 <Table>
 <ID>3</ID>
 <NSC>AAA</NSC>
 <AGDC>BBB</AGD>
 <NSN>CCC</NSN>
 <FIF>DDD</FIF>
 </Table>
 </Root>


Victor
ASKER CERTIFIED SOLUTION
Avatar of Dustin Hopkins
Dustin Hopkins
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you, will try it and get back to you.
@DUSTIN... you are the man! :) :) :) Thanks heavens for LINQ! Check this out without LINQ:
Dim doc As New XmlDocument
doc.Load("c:\temp\duplicates.xml")

Dim duplicates As New NameValueCollection, markedDuplicates As New list(Of String)
For Each node As XmlNode In doc.SelectNodes("//Table")

	Try
		
		Dim id As String = node.SelectSingleNode("ID").InnerText
		If markedDuplicates.Contains(id) Then Continue For
		
		Dim dupsXPath As String = String.Empty
		'We loop through all the nodes under the Table element and build an Xpath to find duplicates nodes
		'This loop avoids hard coding the child element names. I see a few new ones from previous questions, AGDC
		For Each el As XmlNode In node.ChildNodes
			
			If el.Name = "ID" Then
				dupsXPath &= Iif(String.IsNullOrWhiteSpace(dupsXPath), String.Empty, " and ") & String.Format("{0} != ""{1}""", el.Name, el.InnerText)
			Else
				dupsXPath &= Iif(String.IsNullOrWhiteSpace(dupsXPath), String.Empty, " and ") & String.Format("{0} = ""{1}""", el.Name, el.InnerText)
			End If
			
		Next
		dupsXPath = String.Format("//Table[{0}]", dupsXPath)

		'The XPath returns some nodes. We keep track of which ones they are!
		For Each dup As XmlNode In doc.SelectNodes(dupsXPath)
		
			Dim dup_id As String = dup.SelectSingleNode("ID").InnerText
			duplicates.Add(id, dup_id)
			markedDuplicates.Add(dup_id)
		
		Next
		
	Catch ex As Exception
		ex.Dump
	End Try

Next

'We remove the identified nodes
Dim duplicate_nodes = duplicates.AllKeys.SelectMany(AddressOf duplicates.GetValues, Function(k, v) New With {.NodeToKeepId = k, .DuplicateNodeId = v}).ToList()
duplicate_nodes.ForEach(Function(node) doc.DocumentElement.RemoveChild(doc.SelectSingleNode(String.Format("//Table[ID=""{0}""]", node.DuplicateNodeId))))

Open in new window

I created a more complete sample input file, which has nodes without duplicates, and a second set of duplicates apart from what Author provided. Both approaches work. But yours is really nice and concise!
<Root>
  <Table>
    <ID>1</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>2</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>3</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>4</ID>
    <NSC>AAA</NSC>
    <AGDC>XXX</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>5</ID>
    <NSC>AAA</NSC>
    <AGDC>XXX</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>6</ID>
    <NSC>XXX</NSC>
    <AGDC>YYY</AGDC>
    <NSN>ZZZ</NSN>
  </Table>
</Root>

<Root>
  <Table>
    <ID>1</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>4</ID>
    <NSC>AAA</NSC>
    <AGDC>XXX</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>6</ID>
    <NSC>XXX</NSC>
    <AGDC>YYY</AGDC>
    <NSN>ZZZ</NSN>
  </Table>
</Root>

Open in new window

Thank You.