[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 374
  • Last Modified:

Remove duplicate lines with c#(linq is best)

Daer all,

I want to remove the duplicate lines based on the the following example.
POF014	C	C	A	C	AG	T	G	AG	G	GA	T
POF014	C	C	A	C	AG	T	G	AG	G		
C10084	C	C	A	C	G	TC	GA	G	G	A	T
C10083	CA	CT	AG	CT	G	T	G	G	G	A	GT
C10076	A	T	G	C	G	C	GA	G	G	A	G
C10075	A	T	G	C	G	T	G	G	G	A	G
C10068	CA	CT	AG	C	G	TC	GA	G	G	GA	GT
C10067	CA	CT	AG	C	G	T	GA	G	G	A	GT
C10059	CA	CT	AG	C	AG	TC	A	AG	G	GA	GT
C10051	C	C	A	C		TC	GA	A	G	G	T
C10042	CA	CT	AG	C	G	C	GA	G	G	A	
C10034	A	T	G	CT	G	T	GA	G	G	A	G
C10024	A	T	G	CT	G	T	GA	G	G	A	G
C10018	CA	CT	AG	CT	G	TC	A	G	G	A	GT

Open in new window

You see that the items in the first column are keys.
The first row and the second row are duplicate. Usually they are exactly same. However sometimes one of them missed information. Such as in the second row, the last column and the last but one column are empty. In such a case, I would like to keep the row which has more information and delete the line which has less information.

Thus the final result will be
POF014	C	C	A	C	AG	T	G	AG	G	GA	T
C10084	C	C	A	C	G	TC	GA	G	G	A	T
C10083	CA	CT	AG	CT	G	T	G	G	G	A	GT
C10076	A	T	G	C	G	C	GA	G	G	A	G
C10075	A	T	G	C	G	T	G	G	G	A	G
C10068	CA	CT	AG	C	G	TC	GA	G	G	GA	GT
C10067	CA	CT	AG	C	G	T	GA	G	G	A	GT
C10059	CA	CT	AG	C	AG	TC	A	AG	G	GA	GT
C10051	C	C	A	C		TC	GA	A	G	G	T
C10042	CA	CT	AG	C	G	C	GA	G	G	A	
C10034	A	T	G	CT	G	T	GA	G	G	A	G
C10024	A	T	G	CT	G	T	GA	G	G	A	G
C10018	CA	CT	AG	CT	G	TC	A	G	G	A	GT

Open in new window

Thanks.
0
zhshqzyc
Asked:
zhshqzyc
  • 4
  • 3
1 Solution
 
Fernando SotoCommented:
Hi zhshqzyc;

What type of data structure is the data in, a class with properties, DataTablen Daabase, ....

Fernando
0
 
zhshqzycAuthor Commented:
Just a text file. The delimter is a tab.
0
 
Fernando SotoCommented:
Hi zhshqzyc;

This should do what you need.

// Load your file into memory
string[] data = File.ReadAllLines(@"Path to file .txt");

// Remove the dups
var dataNoDups = (from line in data
                  let elements = line.Split(new char[] {'\t'}, StringSplitOptions.RemoveEmptyEntries)
                  group line by elements[0] into lineGroup
                  from ele in lineGroup
                  where ele.Length == lineGroup.Max (e => e.Length)
                  select ele).ToArray();
                  
// dataNoDups is now a array of steings each element is a line of data from the original file

Open in new window


Fernando
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
zhshqzycAuthor Commented:
It works out for deleting the lines that are missing information. If the lines are exactly same, I want to keep only one line. I tested the code, it seemes it lacks of this function.
Could you please double check it?
0
 
Fernando SotoCommented:
Hi zhshqzyc;

Try it like this

var dataNoDups = (from line in data
                  let elements = line.Split(new char[] {'\t'}, StringSplitOptions.RemoveEmptyEntries)
                  group line by elements[0] into lineGroup
                  from ele in lineGroup
                  where ele.Length == lineGroup.Max (e => e.Length)
                  select ele).Distinct().ToArray();

Open in new window

0
 
zhshqzycAuthor Commented:
Great! Thanks.
0
 
Fernando SotoCommented:
Not a problem, glad I was able to help.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now