[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 256
  • Last Modified:

Drop specific columns

Hello, I have a flat file which has many columns, say 200000 columns.
The file is big, say 200MB~2GB.

I want to create a new file but exclude some specific columns.
For example:
List<int> list = new List<int>();
list.Add(5);
list.Add(34);
list.Add(222);

...

Open in new window

I want to remove these columns in list.
Is there a fast way? I used to use parse the text file then split each line to an array...
0
zhshqzyc
Asked:
zhshqzyc
  • 9
  • 7
  • 6
1 Solution
 
käµfm³d 👽Commented:
The file is big, say 200MB~2GB.
That sounds like you'd want to use a StreamReader.

I want to remove these columns in list.
What delimits a "column"?
0
 
zhshqzycAuthor Commented:
tab delimiter.
Can I use DataTable?
0
 
käµfm³d 👽Commented:
Can I use DataTable?
You can, but I don't think you'll see a speed benefit.

For your purposes, do columns "index" start with 0 or 1? For example, you have list.Add(5);. Is this the 5th or 6th column in your file?
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
zhshqzycAuthor Commented:
It does not matter. Index can be zero based.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Hello zhshqzyc, I have made this example, I think you can use it:
string myOriginalFile = @"C:\Temp\OriginalFile.txt";
string myNewFile = @"C:\Temp\NewFile.txt";

System.Collections.Generic.HashSet<int> columnToRemove = new System.Collections.Generic.HashSet<int>();
// Index is zero based.
columnToRemove.Add(5);
columnToRemove.Add(34);
columnToRemove.Add(222);

using (System.IO.StreamReader sr = new System.IO.StreamReader(myOriginalFile))
{
    using (System.IO.StreamWriter sw = new System.IO.StreamWriter(myNewFile))
    {
        string textLine = sr.ReadLine();
        object[] values = null;
        bool appendTab = false;
        while (textLine != null)
        {
            values = textLine.Split('\t');
            appendTab = false;
            for (int i = 0; i < values.Length; i++)
            {
                if (!columnToRemove.Contains(i))
                {
                    if (appendTab)
                        sw.Write('\t');

                    sw.Write(values[i]);

                    if (!appendTab)
                        appendTab = true;
                }
            }
            sw.WriteLine();
            textLine = sr.ReadLine();
        }
    }
}

Open in new window


Please, tell me how it performs with your BIG file.
0
 
zhshqzycAuthor Commented:
Ok, but if linq could works on it, it would be great.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Hello, in what way do you wish to use this with LINQ?
0
 
zhshqzycAuthor Commented:
Like Skip(), Take() something. But I think that it is almost impossible.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
For example, Skip(5) will skip the first 5 columns?
0
 
zhshqzycAuthor Commented:
No, Skip(5) means skip the 5th column(0 based index) in Linq.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Are you sure how to work with LINQ? what you have said does not makes sense to me...
You can find some LINQ examples here:
http://msdn.microsoft.com/en-us/vcsharp/aa336746
0
 
käµfm³d 👽Commented:
Are you after something like this?

using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace _27405384
{
    class Program
    {
        static void Main(string[] args)
        {
            const char DELIMITER = '\t';

            using (StreamReader reader = new StreamReader(@"C:\path\to\file.txt"))
            {
                using (StreamWriter writer = new StreamWriter("output.txt"))
                {
                    List<int> list = new List<int>()
                    {
                        5,
                        34,
                        222,
                    };

                    list.Sort();

                    while (!reader.EndOfStream)
                    {
                        var columns = reader.ReadLine().Split('\t');
                        int curCol = 1;
                        string line = columns.Aggregate(string.Empty, (accumulator, current) => accumulator += (!list.Contains(curCol++) ? current + '\t' : string.Empty));

                        writer.WriteLine(line);
                    }
                }
            }
        }
    }
}

Open in new window

0
 
zhshqzycAuthor Commented:
Kaumed: You are great. Several MVP couldn't figure it out but you did. I hate that I could not give you 10000  points.
0
 
käµfm³d 👽Commented:
It's cool. Your appreciation is enough for me  = )

Glad it worked for you.
0
 
käµfm³d 👽Commented:
P.S.

I forgot to use my DELIMITER variable; you can either remove it or substitute its name for the two occurrences of the tab character in lines 28 & 30. Also, the call to Sort in line 24 is of no use. I was originally writing something a bit more low-level and sorting the list would have meant easier code. Since you said you were after Linq, the "ease" was even easier, and sorting the list was of no tangible benefit. You can safely remove the call to Sort, or you can leave it in. It shouldn't hurt either way. I just wanted to mention that it isn't crucial to the overall logic.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Oh now I see your point, hey you can give the points to kaufmed, if he solution works best for you, I have no problem with that.
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Hey zhshqzyc only one thing please, I want to know if you can test both codes with your file and tell me how much time requires each one, I only want to know how much is the LINQ performance hit, if it exists.
0
 
käµfm³d 👽Commented:
This (paraphrased) change seems to run a tad faster than my previous offering:

...

while (!reader.EndOfStream)
{
    var columns = reader.ReadLine().Split(DELIMITER);
    int curCol = 1;
    System.Text.StringBuilder line = new System.Text.StringBuilder();

    columns.Aggregate(string.Empty, (accumulator, current) => line.Append((!list.Contains(curCol++) ? current + DELIMITER : string.Empty)).Length.ToString());

    writer.WriteLine(line);
}

...

Open in new window


It's a total misuse of the language, but it does seem to work   = )
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
@kaufmed, also you can write directly to the writer without the need of a StringBuilder instance.
0
 
käµfm³d 👽Commented:
also you can write directly to the writer without the need of a StringBuilder instance.
That wouldn't work for my example because inside of the Aggregate's body, whatever operation you do has to return a string. The Write* family of calls are all void returning functions, so there isn't an immediate way that I can see to return a string.

I could just be missing it, though  = )
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
Ah ok, not immediate, but you can use a helper method to do that :)
0
 
Carlos VillegasFull Stack .NET DeveloperCommented:
I dont know if I have explained well, I mean:
columns.Aggregate(string.Empty, (accumulator, current) => MyMethod(writer, (!list.Contains(curCol++) ? current : null));
...
static string MyMethod(StreamWriter writer, string current)
{
if (current != null)
  {
     writer.Write(current);
     writer.Write(DELIMITER);
  }
 return String.Empty;
}

Open in new window

But at the end I think we are losing the LINQ purpose here
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 9
  • 7
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now