Link to home
Start Free TrialLog in
Avatar of zhshqzyc
zhshqzyc

asked on

Drop specific columns

Hello, I have a flat file which has many columns, say 200000 columns.
The file is big, say 200MB~2GB.

I want to create a new file but exclude some specific columns.
For example:
List<int> list = new List<int>();
list.Add(5);
list.Add(34);
list.Add(222);

...

Open in new window

I want to remove these columns in list.
Is there a fast way? I used to use parse the text file then split each line to an array...
Avatar of kaufmed
kaufmed
Flag of United States of America image

The file is big, say 200MB~2GB.
That sounds like you'd want to use a StreamReader.

I want to remove these columns in list.
What delimits a "column"?
Avatar of zhshqzyc
zhshqzyc

ASKER

tab delimiter.
Can I use DataTable?
Can I use DataTable?
You can, but I don't think you'll see a speed benefit.

For your purposes, do columns "index" start with 0 or 1? For example, you have list.Add(5);. Is this the 5th or 6th column in your file?
It does not matter. Index can be zero based.
ASKER CERTIFIED SOLUTION
Avatar of Carlos Villegas
Carlos Villegas
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Ok, but if linq could works on it, it would be great.
Hello, in what way do you wish to use this with LINQ?
Like Skip(), Take() something. But I think that it is almost impossible.
For example, Skip(5) will skip the first 5 columns?
No, Skip(5) means skip the 5th column(0 based index) in Linq.
Are you sure how to work with LINQ? what you have said does not makes sense to me...
You can find some LINQ examples here:
http://msdn.microsoft.com/en-us/vcsharp/aa336746
Are you after something like this?

using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace _27405384
{
    class Program
    {
        static void Main(string[] args)
        {
            const char DELIMITER = '\t';

            using (StreamReader reader = new StreamReader(@"C:\path\to\file.txt"))
            {
                using (StreamWriter writer = new StreamWriter("output.txt"))
                {
                    List<int> list = new List<int>()
                    {
                        5,
                        34,
                        222,
                    };

                    list.Sort();

                    while (!reader.EndOfStream)
                    {
                        var columns = reader.ReadLine().Split('\t');
                        int curCol = 1;
                        string line = columns.Aggregate(string.Empty, (accumulator, current) => accumulator += (!list.Contains(curCol++) ? current + '\t' : string.Empty));

                        writer.WriteLine(line);
                    }
                }
            }
        }
    }
}

Open in new window

Kaumed: You are great. Several MVP couldn't figure it out but you did. I hate that I could not give you 10000  points.
It's cool. Your appreciation is enough for me  = )

Glad it worked for you.
P.S.

I forgot to use my DELIMITER variable; you can either remove it or substitute its name for the two occurrences of the tab character in lines 28 & 30. Also, the call to Sort in line 24 is of no use. I was originally writing something a bit more low-level and sorting the list would have meant easier code. Since you said you were after Linq, the "ease" was even easier, and sorting the list was of no tangible benefit. You can safely remove the call to Sort, or you can leave it in. It shouldn't hurt either way. I just wanted to mention that it isn't crucial to the overall logic.
Oh now I see your point, hey you can give the points to kaufmed, if he solution works best for you, I have no problem with that.
Hey zhshqzyc only one thing please, I want to know if you can test both codes with your file and tell me how much time requires each one, I only want to know how much is the LINQ performance hit, if it exists.
This (paraphrased) change seems to run a tad faster than my previous offering:

...

while (!reader.EndOfStream)
{
    var columns = reader.ReadLine().Split(DELIMITER);
    int curCol = 1;
    System.Text.StringBuilder line = new System.Text.StringBuilder();

    columns.Aggregate(string.Empty, (accumulator, current) => line.Append((!list.Contains(curCol++) ? current + DELIMITER : string.Empty)).Length.ToString());

    writer.WriteLine(line);
}

...

Open in new window


It's a total misuse of the language, but it does seem to work   = )
@kaufmed, also you can write directly to the writer without the need of a StringBuilder instance.
also you can write directly to the writer without the need of a StringBuilder instance.
That wouldn't work for my example because inside of the Aggregate's body, whatever operation you do has to return a string. The Write* family of calls are all void returning functions, so there isn't an immediate way that I can see to return a string.

I could just be missing it, though  = )
Ah ok, not immediate, but you can use a helper method to do that :)
I dont know if I have explained well, I mean:
columns.Aggregate(string.Empty, (accumulator, current) => MyMethod(writer, (!list.Contains(curCol++) ? current : null));
...
static string MyMethod(StreamWriter writer, string current)
{
if (current != null)
  {
     writer.Write(current);
     writer.Write(DELIMITER);
  }
 return String.Empty;
}

Open in new window

But at the end I think we are losing the LINQ purpose here