Link to home
Start Free TrialLog in
Avatar of rwheeler23
rwheeler23Flag for United States of America

asked on

VS C# Strings with quote help

I have a troublesome text file I am trying to import into SQL using SSIS. I may try to write a VS C# application to correct the fields before I insert them in to SQL but here is the problem.

"The rain in Spain stays mainly in the "Plains"" or
"The rain in Spain stays mainly in the "Plains" instead of just
"The rain in Spain stays mainly in the Plains"

Out of 50,000 + lines there are about 100 where the number of quotation marks is messing up the import. Can anyone point to some code or provide hints as to how I can insure that every string field starts and ends with a quotation mark and has no other quotation marks in between the beginning and the end?
Avatar of Dmitry G
Dmitry G
Flag of New Zealand image

Something like this:

        private void button1_Click(object sender, EventArgs e)
        {
            string s1 = "The rain in Spain stays mainly in the \"Plains\"";
            string s2 = "The rain in Spain stays mainly in the \"Plains";

            string s = "";

            s = "\"" + s1.Replace("\"", string.Empty) + "\"";
            System.Diagnostics.Debug.WriteLine(s);
            s = "\"" + s2.Replace("\"", string.Empty) + "\"";
            System.Diagnostics.Debug.WriteLine(s);

        }

Open in new window


The problem is that performance could be not too good . Some tricks might be required to speed up the algorithm.

E.g., first - check if quotation character exists starting from index 1 to length-2 including. If yes - do above. But still  might be slow.
Avatar of rwheeler23

ASKER

Performance is not going to be an issue here. I am doing a one time clean up of old data and have to mop it up before transfer into a new database. I thing my biggest challenge will be setting up the streamreader so that it can read each field separately so your code has the correct string value upon which to work.
Again, it's not clear what the source is? Is it a plain text file. Is each "field" a separate line? Etc. For example, when I deal with my import csv files, I read and process text line by line. But I don't know what structure your source file has...


BTW, I usually use text reader class.
What comes out of this old software is a comma delimited text. The problem is the embedded quotation marks. There will be values like:

"Mr. Mark O"Reilly's rental care fee" or
"Move office to new "Location""

Where the number of quotations marks within the quotation marks varies. The CR/LF combination is at the end of each line of this plain text file.

Here is an actual sample line. This line is fine. It is the ones with the embedded quotation marks that cause grief.

Customer ID,Customer Name,Invoice/CM #,Apply to Invoice Number,Date,Customer PO,Ship Via,Date Due,Discount Amount,Discount Date,Sales Tax ID,Quantity,Item ID,Description,Unit Price,Amount
5013,"UltraGroup Consutling, LLC",111,,6/15/99,,Courier,7/15/99,0.00,6/25/99,,1.00,3700,Merger and Acquistion Service,150.00,-150.00
There are no big challenges  in splitting lines to fields if you have right number of quotation marks. Should be pretty straightforward. But there are some issues.
This is csv. Some fields may contain commas that are not field separators. As a rule such commas sit between quotation marks, as in your case:
"UltraGroup Consutling, LLC"
These commas are to be excluded and we can use string.Split.

On the other hand, if your field has not even number of quotation marks it is pretty distinguish between such commas.
If we have
... ,"UltraGroup Consutling, LLC,

we are in trouble. In our application we discard such records as not valid - just "bad data".
Well I have seen things like:

"Bill O"Reilly prefers you not bloviate" - Would this be a problem?
"Fee for tent "assembly"" - Would this be a problem?
"Fee for tent "assembly" - This looks like a problem?
"Los Angeles, CA 91245"  -  This should be fine

In the file I have never seen just one quotation mark. It always begins and ends with one. It is what happens in between that is the problem.
ASKER CERTIFIED SOLUTION
Avatar of Dmitry G
Dmitry G
Flag of New Zealand image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Is there anyway to adjust the code I have attached to conquer this problem? Would switching to string.split make any difference? If not, does this mean I need to resort to reading each line in in its entirety and then parse each line character by character and proceed as previously mentioned?
TextFieldParser.txt
Hmmm... Honestly, I never used this parser. I don't like to use VB namespace despite I work in VB! This parser is quite primitive. At least, it does not handle cases like:

"Fee for tent "assembly""


For me it is valid case. I'd implement parser myself, it's not too hard.

String.split does not work here...
I made a minor change to the code by inserting a try/catch combination. I just ran this on the largest csv file I have and the one with the most embedded quotation marks. Out of almost 100,000 records there were less than two dozen of these. For now, what I will do is just log the record numbers of the bad ones and then use NOTEPAD to fix then until they all pass. Then when time permitting I will follow your suggestion to just read the entire line and parse it looking for errant quotation marks. The catch caught all the ones with the emdedded quotation marks.
TextFieldParser.txt
This makes sense, well done!
Thank you for all your help!