Link to home
Start Free TrialLog in
Avatar of Mr_Fulano
Mr_FulanoFlag for United States of America

asked on

Need help parsing a CSV file with Memo Fields,

Hello, I'm using C# in Visual Studio 2022.  I've come across an interesting situation in the parsing of a CSV file and would like to seek some advice on how to approach this matter. 


My file contains multiple rows of comma delimited text as would any CSV file.  The values in each row are not hard to split out using the .Split(" , ") method. Thus, I can split the values at each comma.


However, some of my rows contain "values" which are Memo fields. These Memo fields are encapsulated by quotes and some of the Memos contains commas within the quotes. So, it looks something like this:


"Item purchased on 01/23/2021 by employee: Smith, John" , $142.45 , "Jacksonville, FL" , "Item will be shipped to Dallas, TX" , "Status - Complete"

Open in new window

OK, so if I use the .Split(" , ") method,  I would split several of the memo fields, which is not what I want to do.  I'd like to split all the comma delimited values that are NOT enclosed or encapsulated in quotes. 


In the end, you'd have the following values (for this example). I'm showing a "V#" for value number for explanations purposes only. 

V1 = "Item purchased on 01/23/2021 by employee: Smith, John" 
V2 = $142.45 
V3 = "Jacksonville, FL" 
V4 = "Item will be shipped to Dallas, TX" 
V5 = "Status - Complete"

Open in new window

Where values 1, 3, 4 and 5 would be Memo fields.  -- There are others, and its not always these values.  So, there is no pattern to follow. Some rows have Meno fields, some do not. 


So, in sudo-code, it would be something like this... 

Separate all the values delimited by commas, unless the comma is contained within quotes. 


Can anyone suggest a solution?  This one is a bit tricky, but I'm actually looking forward to seeing some of the innovative suggestions for this one.


Thank you! 



Avatar of AndyAinscow
AndyAinscow
Flag of Switzerland image

You might have to do it the hard way.
1 Take a complete line.
2 Check if there are quote marks.
3 If none you can shortcut with the split.
4 If yes you need to take anything up to the first quote and split that part on commas.  Then up to the next quote is kept complete.  Then go back to step 2
Since you have the quotes you could split on the quotes .....you should get 2 parts
work on the 1st to get the values delimited
The 2nd is the memo
Avatar of Mr_Fulano

ASKER

Hi Andy, I've done some of the things you suggested.

1 Take a complete line.  -- DONE
2 Check if there are quote marks.  -- DONE (Most, maybe not all, but most of the lines will have at least 1 set of quotes).
3 If none you can shortcut with the split.  -- These are few and far between.
4 If yes you need to take anything up to the first quote and split that part on commas.  Then up to the next quote is kept complete.  Then go back to step 2. <<<< OK, but this is my questions... how would you do that using my one line example?
Hi John, so to split on the quotes, I would need to get the entire encapsulated quote, not just on 1 quotation mark. I would need to split on something like ---  Split( ' "any characters between" ' ), but I'm not too sure how to do that. Could you give me an example?
You need a true CSV parser. I wrote one but I'm not in front of my PC at the moment. Alternatively, there are many others, including this one available from NuGet which should make it trivial to install into your project:

https://www.nuget.org/packages/CsvHelper/

Don't try to take shortcuts by splitting on characters. You'll end up with more problems. For example, handling an escaped quote inside of a quoted string means you have to ignore the escape character and read that following quote in as a part of the value.
Hi Gr8Gonzo, hope all is well. Yes, splitting on specific characters can become a challenge, given there is no telling what will be contained in the strings for each "value."  

I had thought of maybe a RegEx expression, but not sure if that would work.

So, how does a "true CSV parser" really work.... its definitely not by splitting on characters. I'd like to learn more about those. I would be cool to write my own (given I could learn enough about them).

Challenges like this are fun. I once had to write a custom spell-checker for a program we were developing at work, and that was quote the challenge. Finally did it in my own clunky way, but heck... it worked.  I also learned a lot and it was fun.  Maybe this will be one of those challenges.

Stay well. 
So, the solution was easier than I ever expected.... Wow, chalk one up for Microsoft.  MS has included a CSV parser into VS that works great for C#.  Please see below...

User generated image
Here is a link to the webpage that discussed the DLL reference.  
https://stackoverflow.com/questions/2081418/parsing-csv-files-in-c-with-header#20523165

The code works great. It parses the quoted sections with leaving the comma in place. Excellent DLL.

Thank you all and I think I have a solution to my question.

Thanks again!
ASKER CERTIFIED SOLUTION
Avatar of Mr_Fulano
Mr_Fulano
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Just realized that code is a little old (there's a newer version with TPL-style async syntax), but it still works
Hi Gr8gonzo, I took a look at your code... WOW. Very nice!!! Very elegant coding.  Thank you so much for sharing. I learn a lot from looking at how truly skilled people do their work. Excellent learning example - Thank you!
Hi Gr8gonzo, I agree with you on the VB.NET issue. I used to code in VB.NET many, man-years ago. It was easy to learn and since I am self taught, it wasn't too hard to get to the point I could build simple application. That was about 15 years ago, and I moved onto C# since then. I do however, think that MS will eventually phase VB out. It's not a bad language, but I don't see the same level of progress moving it forward as I do in C# and other languages.

I do think however, it VB will be around for a few more years, so using the VB.DLL isn't totally a situation where my code is in great jeopardy.

Thanks for the input and great advice!  
VB and VB.NET are two very different languages.  It sounds like you are confusing the two.
HI Andy, .. I just abbreviated the name in my post. I used to code in VB.NET.  I do know that VB and VB.NET are two different languages. VB is used in the MS products like MS Excel and such. VB.NET is a VS language. Is that not what Gr8gonzo was speaking about. I would assume so, since its a DLL used in the DOT.NET environment. 
>>VB is used in the MS products like MS Excel
Wrong.  VBA is used in the MS products like MS Excel

VBA is not VB (which are not VB.NET - blame MS for the confusion)

and Gr8gonzo is talking about a dll that uses VB compatability.  MS have not developed VB for quite a long time (over 15 years ???)
OK, so it seems there is yet another "VB" language. I've used VBA, which I dislike almost as much as I do PowerShell. I've also coded a lot in VB.NET, which I used to use about 15 years ago.

Now, VB... all of in itself, I'm not too sure what that is. If its not VBA or VB.NET, I don't know that language.

Interesting... 
I think most of this is a moot point now but just to weigh in on VBA vs VB vs VB.NET...

VB is the original Visual Basic language and went up to version 6 and then stopped being developed but the runtime libraries still come pre-installed with Windows and technically they are still supported for now. Regular VB has nothing to do with .NET.

VBA is simply a scripting engine that uses VB syntax and some of the same mechanisms. So VBA is like VB "light" without the separate compiling step.

VB.NET is a way for developers who liked VB to write .NET code. So it's a way to write .NET code using VB syntax.

.NET is it's own language / framework. C# and VB.NET are simply two different syntaxes to write .NET applications. When you compile a VB.NET or C# project, they both compile to the same intermediate .NET language.

Fun fact: VB3 was my first "real" programming language (I don't count BASIC). It's where I learned object-oriented programming with my Dummy's Guide in hand almost 30 years ago.
Thank you Gr8gonzo, you ALWAYS explain things in such a well paced and educated way. Clearly, your technical knowledge of these topics are evident, but you're never ever condescending, as are some who think they "just simply know more than most"...  I always look forward to reading your posts and I've learned a HUGE amount from you. Thank you for your patience with questions like mine and thank you for your consistent help.

My first programming language was JCL (punch cards in college) then Fortran I think it was Fortran_4 and then Fortran_77, if I recall correctly. Then I moved onto C, VB.NET, C# a little bit of C++ and then Objective C when the iPhones hit the market, then Swift. Along the way I've dabbled in other languages and scripting tools, but I actually like C# a lot. It does what I need and its not to difficult to master. 
Avatar of louisfr
louisfr

I had thought of maybe a RegEx expression, but not sure if that would work.
Sure, it would work.
class CsvReader
{
    string separator;
    public string Separator
    {
        get => separator;
        [MemberNotNull(nameof(separator),nameof(splitRegex))]
        set
        {
            separator = value;
            splitRegex = new(@$"\G(?:""(?<quoted>(?:[^""]|"""")*)""|(?<unquoted>[^{Regex.Escape(separator)}]*))(?:[{Regex.Escape(separator)}]|$)");
        }
    }

    Regex splitRegex;

    public CsvReader() : this(",") { }
    public CsvReader(string separator) => Separator = separator;

    public IEnumerable<string[]> Read(string path)
    {
        foreach (var line in File.ReadAllLines(path))
        {
            yield return splitRegex.Matches(line)
                .Select(m => m.Groups["quoted"].Success
                                ? m.Groups["quoted"].Value.Replace("\"\"", "\"")
                                : m.Groups["unquoted"].Value)
                .ToArray();
        }
    }
}

Open in new window

HI Louisfr.... Hmmmm very interesting. I'm going to give this a try. Thanks!
Sure, [regex] would work.
Not on any complex data. That code will break if you have fields that span onto multiple lines, or if you have several escaped quotes inside the value, or escaped escape chars at the end of a field.

It also performs slightly slower in general than a regular parser because of the overhead of the regex (accounting for a single loop through the resulting data - not just the immediate return due to the yield).

I compared the speed of the CsvReader class, calling Read and then calling ToArray() on a 91k record CSV file (a browscap.csv file) to the CSV class I suggested, reading AND populating a DataTable.

The result:

CsvReader: 91k records in 2.82 seconds (ending in an array of string arrays)
CSV: 91k records in 2.63 seconds (ending in a DataTable)
Hi Gr8gonzo... wow, interesting. I was just curious if it would work. I've used RegEx a bit in the past, but wasn't sure it would or would not work here... guess we now have that answer. 
It would work on simple CSV data, where there's only one line per record, and there aren't excessive embedded quotes.

The big catch is proper timing - the "yield" mechanism can make it seem like it runs extremely fast, because it's not -really- running at that point, but rather saying, "When I do run it, it'll run like this." So if you time the Read() method  on the CsvReader class directly, it'll seem like it runs a huge file in milliseconds, but it's simply setting up an execution "placeholder" of sorts, so you have to loop through the result in order to see the accurate results.

Regexes are great tools, but are usually not the right option when it comes to parsing complex data structures that involves parsing "states" (CSV, HTML, XML, etc). .
if you have several escaped quotes inside the value
I didn't take a lot of time to think about it, so it can probably be improved, but I'm sure that regex handles escaped quotes.
I didn't think about fields spreading over multiple lines.

I didn't intend it as a competition over your csv class. Of course a dedicated parser is better than a regex. But thank you for your performance test; I'm happy it loses only by .2 seconds.

calling Read and then calling ToArray()
In real code, I wouldn't call ToArray(), I would work with the lines one after the other. Creating and filling the array is not good for performance.
reading AND populating a DataTable
I'm not sure populating a DataTable is the best thing to do with the data.
Gentlemen, thank you both for your wealth of knowledge.  Clearly, there are multiple good ways to do any given task. You two gentlemen have found different ways and that's great, because it builds innovation and adds to one's knowledge base.

In the end, both or either solutions would work for my particular case, because my CSV file is simple. One line, per record. My CSV file doesn't span multiple lines, but Gr8gonzo has a good point. He is looking at a more longterm solution that if a CSV file with multiple records were to be used, the "working" application would suddenly break... "but, why... it was working just a minute ago...."  And there in lies the point of the matter. That break may come 3 years into using the application, when you as a programmer are long gone and in a different part of the company or in a different company all together. So, although it wasn't a competition, Gr8gonzo brought out a good point and one we should all be thinking about. I'm not a programmer by profession, but have worked on some of the most technical projects most people would only dream of working on, and I know "bad things happen" - avoiding those bad things by thinking outside the box is a talent we should all aspire to.

Now, having said that, both solutions were excellent work and I thank you both for your help with my matter. You are both gifted programmers and I always welcome your input.

Thank you both.  
@louisfr
I apologize - I wasn't trying to suggest a competition, and I can certainly see how my comment would come across as if I was saying that. My primary point was to say that regexes weren't a good way to approach CSV parsing, for the sake of generic accuracy and performance. It wasn't to try and say my code was special/better or anything (mine will run at about the same speed as any other CSV parser).

In real code, I wouldn't call ToArray(), I would work with the lines one after the other.
I understand and the point of ToArray() was simply to trigger the evaluation of the yield-ed expression so that instead of timing the performance of the yield, it was timing the performance of the parsing.

In any event, my apologies for sounding competitive - that wasn't my intention at all. I was simply trying to provide a commentary on the regex parsing method.