Solved

# Drop specific columns

Posted on 2011-10-19
251 Views
Last Modified: 2012-05-12
Hello, I have a flat file which has many columns, say 200000 columns.
The file is big, say 200MB~2GB.

I want to create a new file but exclude some specific columns.
For example:
List<int> list = new List<int>();
list.Add(5);
list.Add(34);
list.Add(222);

...

I want to remove these columns in list.
Is there a fast way? I used to use parse the text file then split each line to an array...
0
Question by:zhshqzyc
22 Comments

LVL 74

Expert Comment

The file is big, say 200MB~2GB.
That sounds like you'd want to use a StreamReader.

I want to remove these columns in list.
What delimits a "column"?
0

Author Comment

tab delimiter.
Can I use DataTable?
0

LVL 74

Expert Comment

Can I use DataTable?
You can, but I don't think you'll see a speed benefit.

For your purposes, do columns "index" start with 0 or 1? For example, you have list.Add(5);. Is this the 5th or 6th column in your file?
0

Author Comment

It does not matter. Index can be zero based.
0

LVL 17

Accepted Solution

Hello zhshqzyc, I have made this example, I think you can use it:
string myOriginalFile = @"C:\Temp\OriginalFile.txt";
string myNewFile = @"C:\Temp\NewFile.txt";

System.Collections.Generic.HashSet<int> columnToRemove = new System.Collections.Generic.HashSet<int>();
// Index is zero based.
columnToRemove.Add(5);
columnToRemove.Add(34);
columnToRemove.Add(222);

using (System.IO.StreamReader sr = new System.IO.StreamReader(myOriginalFile))
{
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(myNewFile))
{
string textLine = sr.ReadLine();
object[] values = null;
bool appendTab = false;
while (textLine != null)
{
values = textLine.Split('\t');
appendTab = false;
for (int i = 0; i < values.Length; i++)
{
if (!columnToRemove.Contains(i))
{
if (appendTab)
sw.Write('\t');

sw.Write(values[i]);

if (!appendTab)
appendTab = true;
}
}
sw.WriteLine();
textLine = sr.ReadLine();
}
}
}


Please, tell me how it performs with your BIG file.
0

Author Closing Comment

Ok, but if linq could works on it, it would be great.
0

LVL 17

Expert Comment

Hello, in what way do you wish to use this with LINQ?
0

Author Comment

Like Skip(), Take() something. But I think that it is almost impossible.
0

LVL 17

Expert Comment

For example, Skip(5) will skip the first 5 columns?
0

Author Comment

No, Skip(5) means skip the 5th column(0 based index) in Linq.
0

LVL 17

Expert Comment

Are you sure how to work with LINQ? what you have said does not makes sense to me...
You can find some LINQ examples here:
http://msdn.microsoft.com/en-us/vcsharp/aa336746
0

LVL 74

Expert Comment

Are you after something like this?

using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace _27405384
{
class Program
{
static void Main(string[] args)
{
const char DELIMITER = '\t';

using (StreamReader reader = new StreamReader(@"C:\path\to\file.txt"))
{
using (StreamWriter writer = new StreamWriter("output.txt"))
{
List<int> list = new List<int>()
{
5,
34,
222,
};

list.Sort();

while (!reader.EndOfStream)
{
var columns = reader.ReadLine().Split('\t');
int curCol = 1;
string line = columns.Aggregate(string.Empty, (accumulator, current) => accumulator += (!list.Contains(curCol++) ? current + '\t' : string.Empty));

writer.WriteLine(line);
}
}
}
}
}
}

0

Author Comment

Kaumed: You are great. Several MVP couldn't figure it out but you did. I hate that I could not give you 10000  points.
0

LVL 74

Expert Comment

It's cool. Your appreciation is enough for me  = )

Glad it worked for you.
0

LVL 74

Expert Comment

P.S.

I forgot to use my DELIMITER variable; you can either remove it or substitute its name for the two occurrences of the tab character in lines 28 & 30. Also, the call to Sort in line 24 is of no use. I was originally writing something a bit more low-level and sorting the list would have meant easier code. Since you said you were after Linq, the "ease" was even easier, and sorting the list was of no tangible benefit. You can safely remove the call to Sort, or you can leave it in. It shouldn't hurt either way. I just wanted to mention that it isn't crucial to the overall logic.
0

LVL 17

Expert Comment

Oh now I see your point, hey you can give the points to kaufmed, if he solution works best for you, I have no problem with that.
0

LVL 17

Expert Comment

Hey zhshqzyc only one thing please, I want to know if you can test both codes with your file and tell me how much time requires each one, I only want to know how much is the LINQ performance hit, if it exists.
0

LVL 74

Expert Comment

This (paraphrased) change seems to run a tad faster than my previous offering:

...

while (!reader.EndOfStream)
{
var columns = reader.ReadLine().Split(DELIMITER);
int curCol = 1;
System.Text.StringBuilder line = new System.Text.StringBuilder();

columns.Aggregate(string.Empty, (accumulator, current) => line.Append((!list.Contains(curCol++) ? current + DELIMITER : string.Empty)).Length.ToString());

writer.WriteLine(line);
}

...


It's a total misuse of the language, but it does seem to work   = )
0

LVL 17

Expert Comment

@kaufmed, also you can write directly to the writer without the need of a StringBuilder instance.
0

LVL 74

Expert Comment

also you can write directly to the writer without the need of a StringBuilder instance.
That wouldn't work for my example because inside of the Aggregate's body, whatever operation you do has to return a string. The Write* family of calls are all void returning functions, so there isn't an immediate way that I can see to return a string.

I could just be missing it, though  = )
0

LVL 17

Expert Comment

Ah ok, not immediate, but you can use a helper method to do that :)
0

LVL 17

Expert Comment

I dont know if I have explained well, I mean:
columns.Aggregate(string.Empty, (accumulator, current) => MyMethod(writer, (!list.Contains(curCol++) ? current : null));
...
static string MyMethod(StreamWriter writer, string current)
{
if (current != null)
{
writer.Write(current);
writer.Write(DELIMITER);
}
return String.Empty;
}

But at the end I think we are losing the LINQ purpose here
0

## Write Comment

Please enter a first name

Please enter a last name

We will never share this with anyone.

## Featured Post

C# And Nullable Types Since 2.0 C# has Nullable(T) Generic Structure. The idea behind is to allow value type objects to have null values just like reference types have. This concerns scenarios where not all data sources have values (like a databa…
Having new technologies does not mean they will completely replace old components.  Recently I had to create WCF that will be called by VB6 component.  Here I will describe what steps one should follow while doing so, please feel free to post any qu…
Need more eyes on your posted question? Go ahead and follow the quick steps in this video to learn how to Request Attention to your question. *Log into your Experts Exchange account *Find the question you want to Request Attention for *Go to the e…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…

#### 759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

#### Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!