Drop specific columns

Hello, I have a flat file which has many columns, say 200000 columns.
The file is big, say 200MB~2GB.

I want to create a new file but exclude some specific columns.
For example:
List<int> list = new List<int>();
list.Add(5);
list.Add(34);
list.Add(222);

...

Open in new window

I want to remove these columns in list.
Is there a fast way? I used to use parse the text file then split each line to an array...
zhshqzycAsked:
Who is Participating?
 
Carlos VillegasConnect With a Mentor Full Stack .NET DeveloperCommented:
Hello zhshqzyc, I have made this example, I think you can use it:
string myOriginalFile = @"C:\Temp\OriginalFile.txt";
string myNewFile = @"C:\Temp\NewFile.txt";

System.Collections.Generic.HashSet<int> columnToRemove = new System.Collections.Generic.HashSet<int>();
// Index is zero based.
columnToRemove.Add(5);
columnToRemove.Add(34);
columnToRemove.Add(222);

using (System.IO.StreamReader sr = new System.IO.StreamReader(myOriginalFile))
{
    using (System.IO.StreamWriter sw = new System.IO.StreamWriter(myNewFile))
    {
        string textLine = sr.ReadLine();
        object[] values = null;
        bool appendTab = false;
        while (textLine != null)
        {
            values = textLine.Split('\t');
            appendTab = false;
            for (int i = 0; i < values.Length; i++)
            {
                if (!columnToRemove.Contains(i))
                {
                    if (appendTab)
                        sw.Write('\t');

                    sw.Write(values[i]);

                    if (!appendTab)
                        appendTab = true;
                }
            }
            sw.WriteLine();
            textLine = sr.ReadLine();
        }
    }
}

Open in new window


Please, tell me how it performs with your BIG file.
0
 
käµfm³d 👽Commented:
The file is big, say 200MB~2GB.
That sounds like you'd want to use a StreamReader.

I want to remove these columns in list.
What delimits a "column"?
0
 
zhshqzycAuthor Commented:
tab delimiter.
Can I use DataTable?
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
käµfm³d 👽Commented:
Can I use DataTable?
You can, but I don't think you'll see a speed benefit.

For your purposes, do columns "index" start with 0 or 1? For example, you have list.Add(5);. Is this the 5th or 6th column in your file?
0
 
zhshqzycAuthor Commented:
It does not matter. Index can be zero based.
0
 
zhshqzycAuthor Commented:
Ok, but if linq could works on it, it would be great.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Hello, in what way do you wish to use this with LINQ?
0
 
zhshqzycAuthor Commented:
Like Skip(), Take() something. But I think that it is almost impossible.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
For example, Skip(5) will skip the first 5 columns?
0
 
zhshqzycAuthor Commented:
No, Skip(5) means skip the 5th column(0 based index) in Linq.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Are you sure how to work with LINQ? what you have said does not makes sense to me...
You can find some LINQ examples here:
http://msdn.microsoft.com/en-us/vcsharp/aa336746
0
 
käµfm³d 👽Commented:
Are you after something like this?

using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace _27405384
{
    class Program
    {
        static void Main(string[] args)
        {
            const char DELIMITER = '\t';

            using (StreamReader reader = new StreamReader(@"C:\path\to\file.txt"))
            {
                using (StreamWriter writer = new StreamWriter("output.txt"))
                {
                    List<int> list = new List<int>()
                    {
                        5,
                        34,
                        222,
                    };

                    list.Sort();

                    while (!reader.EndOfStream)
                    {
                        var columns = reader.ReadLine().Split('\t');
                        int curCol = 1;
                        string line = columns.Aggregate(string.Empty, (accumulator, current) => accumulator += (!list.Contains(curCol++) ? current + '\t' : string.Empty));

                        writer.WriteLine(line);
                    }
                }
            }
        }
    }
}

Open in new window

0
 
zhshqzycAuthor Commented:
Kaumed: You are great. Several MVP couldn't figure it out but you did. I hate that I could not give you 10000  points.
0
 
käµfm³d 👽Commented:
It's cool. Your appreciation is enough for me  = )

Glad it worked for you.
0
 
käµfm³d 👽Commented:
P.S.

I forgot to use my DELIMITER variable; you can either remove it or substitute its name for the two occurrences of the tab character in lines 28 & 30. Also, the call to Sort in line 24 is of no use. I was originally writing something a bit more low-level and sorting the list would have meant easier code. Since you said you were after Linq, the "ease" was even easier, and sorting the list was of no tangible benefit. You can safely remove the call to Sort, or you can leave it in. It shouldn't hurt either way. I just wanted to mention that it isn't crucial to the overall logic.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Oh now I see your point, hey you can give the points to kaufmed, if he solution works best for you, I have no problem with that.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Hey zhshqzyc only one thing please, I want to know if you can test both codes with your file and tell me how much time requires each one, I only want to know how much is the LINQ performance hit, if it exists.
0
 
käµfm³d 👽Commented:
This (paraphrased) change seems to run a tad faster than my previous offering:

...

while (!reader.EndOfStream)
{
    var columns = reader.ReadLine().Split(DELIMITER);
    int curCol = 1;
    System.Text.StringBuilder line = new System.Text.StringBuilder();

    columns.Aggregate(string.Empty, (accumulator, current) => line.Append((!list.Contains(curCol++) ? current + DELIMITER : string.Empty)).Length.ToString());

    writer.WriteLine(line);
}

...

Open in new window


It's a total misuse of the language, but it does seem to work   = )
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
@kaufmed, also you can write directly to the writer without the need of a StringBuilder instance.
0
 
käµfm³d 👽Commented:
also you can write directly to the writer without the need of a StringBuilder instance.
That wouldn't work for my example because inside of the Aggregate's body, whatever operation you do has to return a string. The Write* family of calls are all void returning functions, so there isn't an immediate way that I can see to return a string.

I could just be missing it, though  = )
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Ah ok, not immediate, but you can use a helper method to do that :)
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
I dont know if I have explained well, I mean:
columns.Aggregate(string.Empty, (accumulator, current) => MyMethod(writer, (!list.Contains(curCol++) ? current : null));
...
static string MyMethod(StreamWriter writer, string current)
{
if (current != null)
  {
     writer.Write(current);
     writer.Write(DELIMITER);
  }
 return String.Empty;
}

Open in new window

But at the end I think we are losing the LINQ purpose here
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.