asked on
Need help parsing a CSV file with Memo Fields,
Hello, I'm using C# in Visual Studio 2022. I've come across an interesting situation in the parsing of a CSV file and would like to seek some advice on how to approach this matter.
My file contains multiple rows of comma delimited text as would any CSV file. The values in each row are not hard to split out using the .Split(" , ") method. Thus, I can split the values at each comma.
However, some of my rows contain "values" which are Memo fields. These Memo fields are encapsulated by quotes and some of the Memos contains commas within the quotes. So, it looks something like this:
"Item purchased on 01/23/2021 by employee: Smith, John" , $142.45 , "Jacksonville, FL" , "Item will be shipped to Dallas, TX" , "Status - Complete"
OK, so if I use the .Split(" , ") method, I would split several of the memo fields, which is not what I want to do. I'd like to split all the comma delimited values that are NOT enclosed or encapsulated in quotes.
In the end, you'd have the following values (for this example). I'm showing a "V#" for value number for explanations purposes only.
V1 = "Item purchased on 01/23/2021 by employee: Smith, John"
V2 = $142.45
V3 = "Jacksonville, FL"
V4 = "Item will be shipped to Dallas, TX"
V5 = "Status - Complete"
Where values 1, 3, 4 and 5 would be Memo fields. -- There are others, and its not always these values. So, there is no pattern to follow. Some rows have Meno fields, some do not.
So, in sudo-code, it would be something like this...
Separate all the values delimited by commas, unless the comma is contained within quotes.
Can anyone suggest a solution? This one is a bit tricky, but I'm actually looking forward to seeing some of the innovative suggestions for this one.
Thank you!
work on the 1st to get the values delimited
The 2nd is the memo
ASKER
1 Take a complete line. -- DONE
2 Check if there are quote marks. -- DONE (Most, maybe not all, but most of the lines will have at least 1 set of quotes).
3 If none you can shortcut with the split. -- These are few and far between.
4 If yes you need to take anything up to the first quote and split that part on commas. Then up to the next quote is kept complete. Then go back to step 2. <<<< OK, but this is my questions... how would you do that using my one line example?
ASKER
https://www.nuget.org/packages/CsvHelper/
Don't try to take shortcuts by splitting on characters. You'll end up with more problems. For example, handling an escaped quote inside of a quoted string means you have to ignore the escape character and read that following quote in as a part of the value.
ASKER
I had thought of maybe a RegEx expression, but not sure if that would work.
So, how does a "true CSV parser" really work.... its definitely not by splitting on characters. I'd like to learn more about those. I would be cool to write my own (given I could learn enough about them).
Challenges like this are fun. I once had to write a custom spell-checker for a program we were developing at work, and that was quote the challenge. Finally did it in my own clunky way, but heck... it worked. I also learned a lot and it was fun. Maybe this will be one of those challenges.
Stay well.
ASKER
Here is a link to the webpage that discussed the DLL reference.
https://stackoverflow.com/questions/2081418/parsing-csv-files-in-c-with-header#20523165
The code works great. It parses the quoted sections with leaving the comma in place. Excellent DLL.
Thank you all and I think I have a solution to my question.
Thanks again!
ASKER
ASKER
I do think however, it VB will be around for a few more years, so using the VB.DLL isn't totally a situation where my code is in great jeopardy.
Thanks for the input and great advice!
ASKER
Wrong. VBA is used in the MS products like MS Excel
VBA is not VB (which are not VB.NET - blame MS for the confusion)
and Gr8gonzo is talking about a dll that uses VB compatability. MS have not developed VB for quite a long time (over 15 years ???)
ASKER
Now, VB... all of in itself, I'm not too sure what that is. If its not VBA or VB.NET, I don't know that language.
Interesting...
VB is the original Visual Basic language and went up to version 6 and then stopped being developed but the runtime libraries still come pre-installed with Windows and technically they are still supported for now. Regular VB has nothing to do with .NET.
VBA is simply a scripting engine that uses VB syntax and some of the same mechanisms. So VBA is like VB "light" without the separate compiling step.
VB.NET is a way for developers who liked VB to write .NET code. So it's a way to write .NET code using VB syntax.
.NET is it's own language / framework. C# and VB.NET are simply two different syntaxes to write .NET applications. When you compile a VB.NET or C# project, they both compile to the same intermediate .NET language.
Fun fact: VB3 was my first "real" programming language (I don't count BASIC). It's where I learned object-oriented programming with my Dummy's Guide in hand almost 30 years ago.
ASKER
My first programming language was JCL (punch cards in college) then Fortran I think it was Fortran_4 and then Fortran_77, if I recall correctly. Then I moved onto C, VB.NET, C# a little bit of C++ and then Objective C when the iPhones hit the market, then Swift. Along the way I've dabbled in other languages and scripting tools, but I actually like C# a lot. It does what I need and its not to difficult to master.
I had thought of maybe a RegEx expression, but not sure if that would work.Sure, it would work.
class CsvReader
{
string separator;
public string Separator
{
get => separator;
[MemberNotNull(nameof(separator),nameof(splitRegex))]
set
{
separator = value;
splitRegex = new(@$"\G(?:""(?<quoted>(?:[^""]|"""")*)""|(?<unquoted>[^{Regex.Escape(separator)}]*))(?:[{Regex.Escape(separator)}]|$)");
}
}
Regex splitRegex;
public CsvReader() : this(",") { }
public CsvReader(string separator) => Separator = separator;
public IEnumerable<string[]> Read(string path)
{
foreach (var line in File.ReadAllLines(path))
{
yield return splitRegex.Matches(line)
.Select(m => m.Groups["quoted"].Success
? m.Groups["quoted"].Value.Replace("\"\"", "\"")
: m.Groups["unquoted"].Value)
.ToArray();
}
}
}
ASKER
Sure, [regex] would work.Not on any complex data. That code will break if you have fields that span onto multiple lines, or if you have several escaped quotes inside the value, or escaped escape chars at the end of a field.
It also performs slightly slower in general than a regular parser because of the overhead of the regex (accounting for a single loop through the resulting data - not just the immediate return due to the yield).
I compared the speed of the CsvReader class, calling Read and then calling ToArray() on a 91k record CSV file (a browscap.csv file) to the CSV class I suggested, reading AND populating a DataTable.
The result:
CsvReader: 91k records in 2.82 seconds (ending in an array of string arrays)
CSV: 91k records in 2.63 seconds (ending in a DataTable)
ASKER
The big catch is proper timing - the "yield" mechanism can make it seem like it runs extremely fast, because it's not -really- running at that point, but rather saying, "When I do run it, it'll run like this." So if you time the Read() method on the CsvReader class directly, it'll seem like it runs a huge file in milliseconds, but it's simply setting up an execution "placeholder" of sorts, so you have to loop through the result in order to see the accurate results.
Regexes are great tools, but are usually not the right option when it comes to parsing complex data structures that involves parsing "states" (CSV, HTML, XML, etc). .
if you have several escaped quotes inside the valueI didn't take a lot of time to think about it, so it can probably be improved, but I'm sure that regex handles escaped quotes.
I didn't think about fields spreading over multiple lines.
I didn't intend it as a competition over your csv class. Of course a dedicated parser is better than a regex. But thank you for your performance test; I'm happy it loses only by .2 seconds.
calling Read and then calling ToArray()In real code, I wouldn't call ToArray(), I would work with the lines one after the other. Creating and filling the array is not good for performance.
reading AND populating a DataTableI'm not sure populating a DataTable is the best thing to do with the data.
ASKER
In the end, both or either solutions would work for my particular case, because my CSV file is simple. One line, per record. My CSV file doesn't span multiple lines, but Gr8gonzo has a good point. He is looking at a more longterm solution that if a CSV file with multiple records were to be used, the "working" application would suddenly break... "but, why... it was working just a minute ago...." And there in lies the point of the matter. That break may come 3 years into using the application, when you as a programmer are long gone and in a different part of the company or in a different company all together. So, although it wasn't a competition, Gr8gonzo brought out a good point and one we should all be thinking about. I'm not a programmer by profession, but have worked on some of the most technical projects most people would only dream of working on, and I know "bad things happen" - avoiding those bad things by thinking outside the box is a talent we should all aspire to.
Now, having said that, both solutions were excellent work and I thank you both for your help with my matter. You are both gifted programmers and I always welcome your input.
Thank you both.
I apologize - I wasn't trying to suggest a competition, and I can certainly see how my comment would come across as if I was saying that. My primary point was to say that regexes weren't a good way to approach CSV parsing, for the sake of generic accuracy and performance. It wasn't to try and say my code was special/better or anything (mine will run at about the same speed as any other CSV parser).
In real code, I wouldn't call ToArray(), I would work with the lines one after the other.I understand and the point of ToArray() was simply to trigger the evaluation of the yield-ed expression so that instead of timing the performance of the yield, it was timing the performance of the parsing.
In any event, my apologies for sounding competitive - that wasn't my intention at all. I was simply trying to provide a commentary on the regex parsing method.
1 Take a complete line.
2 Check if there are quote marks.
3 If none you can shortcut with the split.
4 If yes you need to take anything up to the first quote and split that part on commas. Then up to the next quote is kept complete. Then go back to step 2