[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
• Status: Solved
• Priority: Medium
• Security: Public
• Views: 256
• Last Modified:

Drop specific columns

Hello, I have a flat file which has many columns, say 200000 columns.
The file is big, say 200MB~2GB.

I want to create a new file but exclude some specific columns.
For example:
List<int> list = new List<int>();
list.Add(5);
list.Add(34);
list.Add(222);

...

I want to remove these columns in list.
Is there a fast way? I used to use parse the text file then split each line to an array...
0
zhshqzyc
Asked:
• 9
• 7
• 6
1 Solution

Commented:
The file is big, say 200MB~2GB.
That sounds like you'd want to use a StreamReader.

I want to remove these columns in list.
What delimits a "column"?
0

Author Commented:
tab delimiter.
Can I use DataTable?
0

Commented:
Can I use DataTable?
You can, but I don't think you'll see a speed benefit.

For your purposes, do columns "index" start with 0 or 1? For example, you have list.Add(5);. Is this the 5th or 6th column in your file?
0

Author Commented:
It does not matter. Index can be zero based.
0

Full Stack .NET DeveloperCommented:
Hello zhshqzyc, I have made this example, I think you can use it:
string myOriginalFile = @"C:\Temp\OriginalFile.txt";
string myNewFile = @"C:\Temp\NewFile.txt";

System.Collections.Generic.HashSet<int> columnToRemove = new System.Collections.Generic.HashSet<int>();
// Index is zero based.
columnToRemove.Add(5);
columnToRemove.Add(34);
columnToRemove.Add(222);

using (System.IO.StreamReader sr = new System.IO.StreamReader(myOriginalFile))
{
using (System.IO.StreamWriter sw = new System.IO.StreamWriter(myNewFile))
{
string textLine = sr.ReadLine();
object[] values = null;
bool appendTab = false;
while (textLine != null)
{
values = textLine.Split('\t');
appendTab = false;
for (int i = 0; i < values.Length; i++)
{
if (!columnToRemove.Contains(i))
{
if (appendTab)
sw.Write('\t');

sw.Write(values[i]);

if (!appendTab)
appendTab = true;
}
}
sw.WriteLine();
textLine = sr.ReadLine();
}
}
}


Please, tell me how it performs with your BIG file.
0

Author Commented:
Ok, but if linq could works on it, it would be great.
0

Full Stack .NET DeveloperCommented:
Hello, in what way do you wish to use this with LINQ?
0

Author Commented:
Like Skip(), Take() something. But I think that it is almost impossible.
0

Full Stack .NET DeveloperCommented:
For example, Skip(5) will skip the first 5 columns?
0

Author Commented:
No, Skip(5) means skip the 5th column(0 based index) in Linq.
0

Full Stack .NET DeveloperCommented:
Are you sure how to work with LINQ? what you have said does not makes sense to me...
You can find some LINQ examples here:
http://msdn.microsoft.com/en-us/vcsharp/aa336746
0

Commented:
Are you after something like this?

using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace _27405384
{
class Program
{
static void Main(string[] args)
{
const char DELIMITER = '\t';

using (StreamReader reader = new StreamReader(@"C:\path\to\file.txt"))
{
using (StreamWriter writer = new StreamWriter("output.txt"))
{
List<int> list = new List<int>()
{
5,
34,
222,
};

list.Sort();

while (!reader.EndOfStream)
{
var columns = reader.ReadLine().Split('\t');
int curCol = 1;
string line = columns.Aggregate(string.Empty, (accumulator, current) => accumulator += (!list.Contains(curCol++) ? current + '\t' : string.Empty));

writer.WriteLine(line);
}
}
}
}
}
}

0

Author Commented:
Kaumed: You are great. Several MVP couldn't figure it out but you did. I hate that I could not give you 10000  points.
0

Commented:
It's cool. Your appreciation is enough for me  = )

Glad it worked for you.
0

Commented:
P.S.

I forgot to use my DELIMITER variable; you can either remove it or substitute its name for the two occurrences of the tab character in lines 28 & 30. Also, the call to Sort in line 24 is of no use. I was originally writing something a bit more low-level and sorting the list would have meant easier code. Since you said you were after Linq, the "ease" was even easier, and sorting the list was of no tangible benefit. You can safely remove the call to Sort, or you can leave it in. It shouldn't hurt either way. I just wanted to mention that it isn't crucial to the overall logic.
0

Full Stack .NET DeveloperCommented:
Oh now I see your point, hey you can give the points to kaufmed, if he solution works best for you, I have no problem with that.
0

Full Stack .NET DeveloperCommented:
Hey zhshqzyc only one thing please, I want to know if you can test both codes with your file and tell me how much time requires each one, I only want to know how much is the LINQ performance hit, if it exists.
0

Commented:
This (paraphrased) change seems to run a tad faster than my previous offering:

...

while (!reader.EndOfStream)
{
var columns = reader.ReadLine().Split(DELIMITER);
int curCol = 1;
System.Text.StringBuilder line = new System.Text.StringBuilder();

columns.Aggregate(string.Empty, (accumulator, current) => line.Append((!list.Contains(curCol++) ? current + DELIMITER : string.Empty)).Length.ToString());

writer.WriteLine(line);
}

...


It's a total misuse of the language, but it does seem to work   = )
0

Full Stack .NET DeveloperCommented:
@kaufmed, also you can write directly to the writer without the need of a StringBuilder instance.
0

Commented:
also you can write directly to the writer without the need of a StringBuilder instance.
That wouldn't work for my example because inside of the Aggregate's body, whatever operation you do has to return a string. The Write* family of calls are all void returning functions, so there isn't an immediate way that I can see to return a string.

I could just be missing it, though  = )
0

Full Stack .NET DeveloperCommented:
Ah ok, not immediate, but you can use a helper method to do that :)
0

Full Stack .NET DeveloperCommented:
I dont know if I have explained well, I mean:
columns.Aggregate(string.Empty, (accumulator, current) => MyMethod(writer, (!list.Contains(curCol++) ? current : null));
...
static string MyMethod(StreamWriter writer, string current)
{
if (current != null)
{
writer.Write(current);
writer.Write(DELIMITER);
}
return String.Empty;
}

But at the end I think we are losing the LINQ purpose here
0

Featured Post

• 9
• 7
• 6
Tackle projects and never again get stuck behind a technical roadblock.