• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 286
  • Last Modified:

remove duplicate line quickly

Hi, I have a huge text file which has millions lines. I want to delete the duplicate line if existing.
The line format is like
2 rs12345 4567 A G T C C T

Open in new window

space seperator
Thanks
0
zhshqzyc
Asked:
zhshqzyc
  • 2
1 Solution
 
zhshqzycAuthor Commented:
rs12345 

Open in new window

is duplicated here.
0
 
wdosanjosCommented:
You can try something like this.  It assumes the 2nd field (rs12345, in your example) is the key.

var keys = new HashSet<string>();

var input = new StreamReader(@"C:\temp\inputfile.txt");
var output = new StreamWriter(@"C:\temp\outputfile.txt");

for (var line = input.ReadLine(); line != null; line = input.ReadLine())
{
	var key = line.Split(' ')[1];
	
	if (!keys.Contains(key))
	{
		keys.Add(key);
		
		output.WriteLine(line);
	}
}

output.Close();
input.Close();

Open in new window

0
 
zhshqzycAuthor Commented:
Can I use linq?
Can you check my code?
var dataNoDups = (from line in data
                  let elements = line.Split(new char[] {' '}, StringSplitOptions.RemoveEmptyEntries)
                  group line by elements[1] into lineGroup
                  from ele in lineGroup
                   select ele).Distinct().ToArray(); 

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now