Quickly finding the match to extract the expected rows

Hi,

I have a huge text file(50GB), say file xyz.txt. The format likes
21 rs885550 0 9887804 C C C C C C C C C C C C C ......
21 rs169757 0 9928594 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs210498 0 9928860 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs210499 0 9929079 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs303304 0 9941889 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs4913553 0 9941912 0 0 0 0 0 0 0 0 0 0 0 0 ..... 

Open in new window

Please notice the second column. The format is unique. It might always begin with "rs". After the fouth column, the element is just one char.The first column is a number, the third column is 0. And the fouth column is a big integer. The seperator is a white space.
And I have a string list abc:
rs10000092
rs10000100
rs10000101
rs1000012
rs10000121
rs10000124
rs1000013
rs10000136
rs10000139

Open in new window

Now I want to extract the lines from xyz.txt. The condition is that the list abc contains the second column in xyz.txt.
My code:
                StreamReader sr = new StreamReader(xyz.txt);
                List<string> list = new List<string>();
                while (!sr.EndOfStream)
                {
                    string line = sr.ReadLine ();
                    var fields = line.Split(' ');
                    if (!abc.Contains(fields[1]))
                        continue;
                    else
                        list.Add(line);
                }

Open in new window

My question is because of file's size, it is very slow.I still get nothing after running the program 30 hours. I guess that splitting the line into an array cost too much time.
So I want to improve the effiency.
Thanks for help.
zhshqzycAsked:
Who is Participating?
 
wdosanjosConnect With a Mentor Commented:
Please try the following code.  It keeps track if the key has already been copied to the output file.

Dictionary<string, bool> abc = new Dictionary<string, bool>();
// populate abc with the list of keys for selection
// 	abc.Add("rs10000092", false);
// 	abc.Add("rs10000100", false);
//	etc...

StreamReader sr = new StreamReader("xyz.txt");
StreamWriter sw = new StreamWriter("selected.xyz.txt");

for (var line = sr.ReadLine(); line != null; line = sr.ReadLine())
{
	int start = line.IndexOf(' ') + 1;
	int len = line.IndexOf(' ', start) - start;
	var key = line.Substring(start, len);
	bool isCopied;
	
	if (abc.TryGetValue(key, out isCopied))
	{
		if (!isCopied)
		{
			sw.WriteLine(line);
			
			abc[key] = true;
		}
	}
}

sw.Close();
sr.Close();

Open in new window

0
 
wdosanjosCommented:
Please try the following.  It outputs the selected lines to a file instead of memory:

HashSet<string> abc = new HashSet<string>();
// populate abc with the list of keys for selection

StreamReader sr = new StreamReader("xyz.txt");
StreamWriter sw = new StreamWriter("selected.xyz.txt");

for (var line = sr.ReadLine(); line != null; line = sr.ReadLine())
{
	int start = line.IndexOf(' ') + 1;
	int len = line.IndexOf(' ', start) - start;
	var key = line.Substring(start, len);
	
	if (abc.Contains(key))
	{
		wr.WriteLine(line);
	}
}

sw.Close();
sr.Close();

Open in new window

0
 
zhshqzycAuthor Commented:
It's fast, only one hour.
BTW, if I want to remove the duplicate key, like you did it before
Do we need change code?

Thanks.
0
 
zhshqzycAuthor Commented:
Thanks.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.