?
Solved

De-Duping a Text file

Posted on 2011-10-13
9
Medium Priority
?
317 Views
Last Modified: 2012-06-21
I have a text file with 100 character lines (100,000 lines).  I need to dedupe this file and the de-duped output file must be in same order as input.  What is a good way to process the file.
I have worked with streamreader/writer.  It is the random order de-duping that I can't see.
Could you include example with solution.
0
Comment
Question by:garyinmiami2003
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 28

Expert Comment

by:strickdd
ID: 36964163
using (StreamReader sr = new StreamReader("TestFile.txt")) 
{
	string line;
	List<string> lineList = new List<string>();
	List<int> deleteIndexes = new List<int>();
	
	while ((line = sr.ReadLine()) != null) 
	{
	    lineList.Add(line);	    
	}
	
	//Find duplicate indexes
	for(int i = 0; i<lineList.Count - 1; i++)
	{
		for(int j = i+1; j<lineList.Count; j++)
		{
			if(lineList[i] == lineList[j])
			{
				deleteIndexes.Add(j); //remove 2nd duplication of line
			}
		}
	}
	
	deleteIndexs.Revers(); //start deletion from the last index to prevent shifting index issues
	
	foreach(int index in deleteIndexes)
	{
		lineList.RemoveAt(index);
	}
	
	//write lineList to streamwriter
	foreach(string newLine in lineList)
	{
		sw.WriteLine(newLine);
	}
}

Open in new window

0
 

Author Comment

by:garyinmiami2003
ID: 36964190
Same code in vb.net?
0
 
LVL 28

Assisted Solution

by:strickdd
strickdd earned 200 total points
ID: 36964225
This should be close, just put it through a converted:

Using sr As New StreamReader("TestFile.txt")
	Dim line As String
	Dim lineList As New List(Of String)()
	Dim deleteIndexes As New List(Of Integer)()

	While (InlineAssignHelper(line, sr.ReadLine())) IsNot Nothing
		lineList.Add(line)
	End While

	'Find duplicate indexes
	For i As Integer = 0 To lineList.Count - 2
		For j As Integer = i + 1 To lineList.Count - 1
			If lineList(i) = lineList(j) Then
					'remove 2nd duplication of line
				deleteIndexes.Add(j)
			End If
		Next
	Next

	deleteIndexs.Revers()
	'start deletion from the last index to prevent shifting index issues
	For Each index As Integer In deleteIndexes
		lineList.RemoveAt(index)
	Next

	'write lineList to streamwriter
	For Each newLine As String In lineList
		sw.WriteLine(newLine)
	Next
End Using

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:garyinmiami2003
ID: 36964376
strickdd:

a couple of errors:
sw.WriteLine(newLine)
Not a member of StreamReader

InlineAssignHelper not declared


 
0
 
LVL 28

Expert Comment

by:strickdd
ID: 36964421
When you said "I have worked with streamreader/writer." I figured you had a stream writer variable declared and opened already. Add the declaration and put the variable name in place of the "sw" and it should work.
0
 
LVL 4

Assisted Solution

by:Ambusy
Ambusy earned 600 total points
ID: 36964448
The previous code misses a IF NOT CONTAINS test: is a line occurs 3 time things go wrong, as the same indes is deleted more than once.
 
        Dim lineList As New List(Of String)
        Dim deleteIndexes As New List(Of Integer)
        Using sr As StreamReader = New StreamReader("a.txt")
            Dim line As String = sr.ReadLine()
            While Not (line Is Nothing)
                lineList.Add(line)
                line = sr.ReadLine()
            End While
            sr.Close()
        End Using
        'Find duplicate indexes
        For i As Integer = 0 To lineList.Count - 1
            For j As Integer = i + 1 To lineList.Count - 1
                If lineList(i) = lineList(j) Then
                    If Not deleteIndexes.Contains(j) Then
                        deleteIndexes.Add(j) 'remove 2nd duplication of line
                    End If
                End If
            Next
        Next
        deleteIndexes.Reverse() '; //start deletion from the last index to prevent shifting index issues
        For Each index As Integer In deleteIndexes
            lineList.RemoveAt(index)
        Next
        Dim sw As New StreamWriter("a.txt")
        ' //write lineList to streamwriter
        For Each newLine As String In lineList
            sw.WriteLine(newLine)
        Next
        sw.Close()

Open in new window

0
 
LVL 86

Accepted Solution

by:
Mike Tomlinson earned 1200 total points
ID: 36964634
Here's another one:
Dim FileName As String = "C:\Users\Mike\Documents\SomeFile.txt"

        Dim lines As New List(Of String)
        lines.AddRange(System.IO.File.ReadAllLines(FileName))

        Dim index As Integer = 0
        Dim keys As New Dictionary(Of String, String)
        While index < lines.Count
            If Not keys.ContainsKey(lines(index)) Then
                keys.Add(lines(index), Nothing)
                index = index + 1
            Else
                lines.RemoveAt(index)
            End If
        End While

        System.IO.File.WriteAllLines(FileName, lines.ToArray)

Open in new window

0
 

Author Closing Comment

by:garyinmiami2003
ID: 36965095
strickdd - never got his to work but sure it is close

Ambusy - Think that would work - I had a double extension on test file (hidden)  my probl;em

Wound up using idle mind and got it to work.
My thanks to you all and hope you feel points went out fairly.
0
 
LVL 4

Expert Comment

by:Ambusy
ID: 36967729
you sure did well.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to create and use a custom WaterMark textbox class.  The custom WaterMark textbox class allows you to set the WaterMark Background Color and WaterMark text at design time.   IMAGE OF WATERMARKS STEPS Create VB …
Creating an analog clock UserControl seems fairly straight forward.  It is, after all, essentially just a circle with several lines in it!  Two common approaches for rendering an analog clock typically involve either manually calculating points with…
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an anti-spam), the admin…
Suggested Courses

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question