?
Solved

Quickly finding the match to extract the expected rows

Posted on 2011-05-10
4
Medium Priority
?
218 Views
Last Modified: 2012-05-11
Hi,

I have a huge text file(50GB), say file xyz.txt. The format likes
21 rs885550 0 9887804 C C C C C C C C C C C C C ......
21 rs169757 0 9928594 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs210498 0 9928860 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs210499 0 9929079 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs303304 0 9941889 0 0 0 0 0 0 0 0 0 0 0 0 0 .......
21 rs4913553 0 9941912 0 0 0 0 0 0 0 0 0 0 0 0 ..... 

Open in new window

Please notice the second column. The format is unique. It might always begin with "rs". After the fouth column, the element is just one char.The first column is a number, the third column is 0. And the fouth column is a big integer. The seperator is a white space.
And I have a string list abc:
rs10000092
rs10000100
rs10000101
rs1000012
rs10000121
rs10000124
rs1000013
rs10000136
rs10000139

Open in new window

Now I want to extract the lines from xyz.txt. The condition is that the list abc contains the second column in xyz.txt.
My code:
                StreamReader sr = new StreamReader(xyz.txt);
                List<string> list = new List<string>();
                while (!sr.EndOfStream)
                {
                    string line = sr.ReadLine ();
                    var fields = line.Split(' ');
                    if (!abc.Contains(fields[1]))
                        continue;
                    else
                        list.Add(line);
                }

Open in new window

My question is because of file's size, it is very slow.I still get nothing after running the program 30 hours. I guess that splitting the line into an array cost too much time.
So I want to improve the effiency.
Thanks for help.
0
Comment
Question by:zhshqzyc
  • 2
  • 2
4 Comments
 
LVL 23

Expert Comment

by:wdosanjos
ID: 35734796
Please try the following.  It outputs the selected lines to a file instead of memory:

HashSet<string> abc = new HashSet<string>();
// populate abc with the list of keys for selection

StreamReader sr = new StreamReader("xyz.txt");
StreamWriter sw = new StreamWriter("selected.xyz.txt");

for (var line = sr.ReadLine(); line != null; line = sr.ReadLine())
{
	int start = line.IndexOf(' ') + 1;
	int len = line.IndexOf(' ', start) - start;
	var key = line.Substring(start, len);
	
	if (abc.Contains(key))
	{
		wr.WriteLine(line);
	}
}

sw.Close();
sr.Close();

Open in new window

0
 

Author Comment

by:zhshqzyc
ID: 35737286
It's fast, only one hour.
BTW, if I want to remove the duplicate key, like you did it before
Do we need change code?

Thanks.
0
 
LVL 23

Accepted Solution

by:
wdosanjos earned 2000 total points
ID: 35739275
Please try the following code.  It keeps track if the key has already been copied to the output file.

Dictionary<string, bool> abc = new Dictionary<string, bool>();
// populate abc with the list of keys for selection
// 	abc.Add("rs10000092", false);
// 	abc.Add("rs10000100", false);
//	etc...

StreamReader sr = new StreamReader("xyz.txt");
StreamWriter sw = new StreamWriter("selected.xyz.txt");

for (var line = sr.ReadLine(); line != null; line = sr.ReadLine())
{
	int start = line.IndexOf(' ') + 1;
	int len = line.IndexOf(' ', start) - start;
	var key = line.Substring(start, len);
	bool isCopied;
	
	if (abc.TryGetValue(key, out isCopied))
	{
		if (!isCopied)
		{
			sw.WriteLine(line);
			
			abc[key] = true;
		}
	}
}

sw.Close();
sr.Close();

Open in new window

0
 

Author Closing Comment

by:zhshqzyc
ID: 35739938
Thanks.
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article describes a simple method to resize a control at runtime.  It includes ready-to-use source code and a complete sample demonstration application.  We'll also talk about C# Extension Methods. Introduction In one of my applications…
Performance in games development is paramount: every microsecond counts to be able to do everything in less than 33ms (aiming at 16ms). C# foreach statement is one of the worst performance killers, and here I explain why.
As many of you are aware about Scanpst.exe utility which is owned by Microsoft itself to repair inaccessible or damaged PST files, but the question is do you really think Scanpst.exe is capable to repair all sorts of PST related corruption issues?
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…
Suggested Courses

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question