Help with removing duplicate records in an xml file using VB.NET

Hi,

How do you remove duplicate records in an xml file? For example if I have an xml file with the following data:

<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>2</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
<Table>
<ID>3</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>

How do I obtain the following xml fil?

<Root?
<Table>
<ID>1</ID>
<NSC>AAA</NSC>
<AGDC>BBB</AGD>
<NSN>CCC</NSN>
</Table>
</Root>
The actual xml file contains more data element and more records.

Thanks,

Victor
vcharlesAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

vcharlesAuthor Commented:
Hi,

I need to modify my question, how do I remove duplicate records if NSC, AGD and NSN have the same values? The file contains other data elements. For example,

<Root?
 <Table>
 <ID>1</ID>
 <NSC>AAA</NSC>
 <AGDC>BBB</AGD>
 <NSN>CCC</NSN>
 <FIF>DDD</FIF>
 </Table>
 <Table>
 <ID>2</ID>
 <NSC>AAA</NSC>
 <AGDC>BBB</AGD>
 <NSN>CCC</NSN>
 <FIF>DDD</FIF>
 </Table>
 <Table>
 <ID>3</ID>
 <NSC>AAA</NSC>
 <AGDC>BBB</AGD>
 <NSN>CCC</NSN>
 <FIF>DDD</FIF>
 </Table>
 </Root>


Victor
0
Dustin HopkinsSenior Web DeveloperCommented:
Hi Victor,
This should do what you want.
Basically we are reading the file into an xDocument,
then using linq to group like elements,
then checking if the grouped items are greater than 1 meaning a duplicate exists,
then skipping the first result (we still want a single copy in our new list)
Then we delete the ones that are left.
Finally we save to a new file, or you could save to the last one.

VB:
Dim fileName = "input.xml"
        Dim xDoc As XDocument = XDocument.Load(fileName)
        Dim dups = (From n In xDoc.Descendants("Table")
                    Group n By g = new with {.Nsc = n.Element("NSC").Value.ToLower()}.Nsc, 
                                   new with {.Agd = n.Element("AGD").Value.ToLower()}.Agd,
                                   new with {.Nsn = n.Element("NSN").Value.ToLower()}.Nsn
                        Into nscGroup = Group
                    Where nscGroup.Count > 1
                    From i In nscGroup.Skip(1)
                    Select i).ToList()
        dups.ForEach(Sub(i) i.Remove())

        xDoc.Save("output.xml")

Open in new window


C#
string fileName = "input.xml";
            XDocument xDoc =XDocument.Load(fileName);
            (from n in xDoc.Descendants("Table")
                group n by new {nsc = n.Element("NSC").Value.ToLower(), agd = n.Element("AGD").Value.ToLower(), nsn = n.Element("NSN").Value.ToLower() }
                into nscGroup
                where nscGroup.Count() > 1
                from i in nscGroup.Skip(1)
                select i).Remove();

                xDoc.Save("output.xml");

Open in new window

Hope this helps,
Dustin
2

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
vcharlesAuthor Commented:
Thank you, will try it and get back to you.
0
MlandaTCommented:
@DUSTIN... you are the man! :) :) :) Thanks heavens for LINQ! Check this out without LINQ:
Dim doc As New XmlDocument
doc.Load("c:\temp\duplicates.xml")

Dim duplicates As New NameValueCollection, markedDuplicates As New list(Of String)
For Each node As XmlNode In doc.SelectNodes("//Table")

	Try
		
		Dim id As String = node.SelectSingleNode("ID").InnerText
		If markedDuplicates.Contains(id) Then Continue For
		
		Dim dupsXPath As String = String.Empty
		'We loop through all the nodes under the Table element and build an Xpath to find duplicates nodes
		'This loop avoids hard coding the child element names. I see a few new ones from previous questions, AGDC
		For Each el As XmlNode In node.ChildNodes
			
			If el.Name = "ID" Then
				dupsXPath &= Iif(String.IsNullOrWhiteSpace(dupsXPath), String.Empty, " and ") & String.Format("{0} != ""{1}""", el.Name, el.InnerText)
			Else
				dupsXPath &= Iif(String.IsNullOrWhiteSpace(dupsXPath), String.Empty, " and ") & String.Format("{0} = ""{1}""", el.Name, el.InnerText)
			End If
			
		Next
		dupsXPath = String.Format("//Table[{0}]", dupsXPath)

		'The XPath returns some nodes. We keep track of which ones they are!
		For Each dup As XmlNode In doc.SelectNodes(dupsXPath)
		
			Dim dup_id As String = dup.SelectSingleNode("ID").InnerText
			duplicates.Add(id, dup_id)
			markedDuplicates.Add(dup_id)
		
		Next
		
	Catch ex As Exception
		ex.Dump
	End Try

Next

'We remove the identified nodes
Dim duplicate_nodes = duplicates.AllKeys.SelectMany(AddressOf duplicates.GetValues, Function(k, v) New With {.NodeToKeepId = k, .DuplicateNodeId = v}).ToList()
duplicate_nodes.ForEach(Function(node) doc.DocumentElement.RemoveChild(doc.SelectSingleNode(String.Format("//Table[ID=""{0}""]", node.DuplicateNodeId))))

Open in new window

I created a more complete sample input file, which has nodes without duplicates, and a second set of duplicates apart from what Author provided. Both approaches work. But yours is really nice and concise!
<Root>
  <Table>
    <ID>1</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>2</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>3</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>4</ID>
    <NSC>AAA</NSC>
    <AGDC>XXX</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>5</ID>
    <NSC>AAA</NSC>
    <AGDC>XXX</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>6</ID>
    <NSC>XXX</NSC>
    <AGDC>YYY</AGDC>
    <NSN>ZZZ</NSN>
  </Table>
</Root>

<Root>
  <Table>
    <ID>1</ID>
    <NSC>AAA</NSC>
    <AGDC>BBB</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>4</ID>
    <NSC>AAA</NSC>
    <AGDC>XXX</AGDC>
    <NSN>CCC</NSN>
  </Table>
  <Table>
    <ID>6</ID>
    <NSC>XXX</NSC>
    <AGDC>YYY</AGDC>
    <NSN>ZZZ</NSN>
  </Table>
</Root>

Open in new window

0
vcharlesAuthor Commented:
Thank You.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Visual Basic.NET

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.