Restoring Missing Column Delimiters in SSIS

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

ASKER

I apologize... that is correct, I did add a pipe and a double quote. I originally learned the C suite but its been awhile.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

ASKER

@Jim, thank you for your comment and I do agree. If the data didn't derive from BLOB data, I would be all over the programmer for outputting this this way.

@Megan, thank you again as you make very good points. I guess I was assuming this might be easier than what I was making it out to be. As for the files, there are 20-30 of them and they may not always have the same issue(s). I have not looked through them all but I was trying to do my research to give my boss(es) options when they schedule a meeting about how we are to process this data.

As for the answer to your question, right now the only issue that I have repeatedly run into are situations like the one in the question where a |" are missing causing an SSIS error. I would need to search for |"(n)"(n)"|, where (n) could be any value.

I think I see what the omission is (and by the way, I have worked with worse data than this, always from government sources, where it was hand corrected after machine generation, with added typos). The question I am wondering, however, whether you would be able to unambiguously fix all the omissions using a simple search and replace, or whether more complex rules would be needed, or perhaps whether there is no way to do it automatically at all.

ASKER

You may be very well correct Megan. This data is not going to be clean and I may have to deal with different things on a case by case basis. I just know that I can't import the files without either redirecting erronious rows to different output or by fixing the bad rows before import. I just have a feeling that my bosses won't like the idea of potentially "ignoring" those bad rows.

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

ASKER

This is hospital data and they know that it's not the best but we will need to try and use everything that we can.

Thank you to both of of you as I am trying to research as much as I can so I have answers/solutions for things they might bring up.

Yes, I figured. Mine was state community health center data. I see a pattern there too. :-)

ASKER

Sorry if I closed the question prematurely, I just figured that I couldn't be the olny one who has issues like this.

We all have issues. Some of us have an entire subscription.

Thanks for the split. Good luck with your project.

ASKER

Would anyone happen to have an example of C# code they used to search for and fixing missing column delimiters?

If not that, would you have an example of a C# script that finds columns (strings) that have "" in them and replaces it with just a single set of " ?

Thanks again!

^^^ I'd ask that as a question in the C# zone, as what I know about C# can be written in large letters on the back of a matchbook cover using a grease pencil, and I suspect many other SQL Server experts are the same way.

I could probably write something. Is the search-and-replace pattern in #a42115215 what you need? Are you sure? I think it would just be a regex operation.

ASKER

Megan,

Yes it would be. I have another case where I have a column value like

|""some text here." Some more text here."|

I have been thinking about RegEx myself but have no idea where to begin with it.

Thank you again!

That's why I asked "are you sure?" :-)

So far, you know of two patterns. One involves two properly formed quoted strings that need a pipe inserted between them. The other involves seriously malformed quoted text that needs repair before insertion (unless that was a typo above).

Are there escapes to deal with? Can the data itself contain double quotes or pipes? Can a row contain more pipes than there are columns, and if so what should be done with the surplus?

ASKER

Thanks.
Every column should be delimited with a pipe, there are cases where there are some missing.

Every text string in the file should begin and end with a double quote. There are cases, like shown above where there are multiple quotes in the same string (column). In other words, that is one column but it has multiple sets of double quotes.

Do you happen to have a sample data file available already?

ASKER

Good morning,

I currently have files that I working with. I would have to scrub the data to share a sample file with you, if that is what you are asking?

Just a few scrubbed lines, some good, some bad, would be perfect. That can save us a long string of questions and answers.

I wrote a stub last night, an SSIS script transformation component, and I am almost ready to plug in the regex. I have several other things going on, but regexes are something I lose if I don't use them often and I am always glad to take another one on.

ASKER

I have attached a small text file (UTF-8) with a header row. Obviously column names and underlying data has been changed but the format stayed the same.

Row 1: Header row
Rows 2-6: Good data rows
Row 7: Missing column delimiter in columns 9 and 10
Rows 8-10: Multiple columns have double the text qualifiers in them

Text qualifier: “ (double quote)
Column delimiter: | (pipe)

Thank you.
regex-data.txt

This sample looks suspiciously like yet another instance of improperly escaped delimited text data. I located a C# utility, CSVFixer, on github for fixing bad CSV files, and I might just be able to adapt it as a couple of class files added to the script component. If not, back to the regex.

Improperly escaped delimited files are a problem that comes up for me often enough, and I'm willing to spend some time on finding a solution for that reason alone.

ASKER

Thank you! I will certainly take a look at the CSV Fixer and see what I can do.

As always, thank you for your time, and if there's anything I can ever help you with, I'm more than happy to do so!

This happens, by the way, when somebody takes delimited data containing characters that require escaping, such as ("), and simply wraps the column with double quote text qualifiers.

Common Format and MIME Type for Comma-Separated Values (CSV) Files

ASKER

Thank you for the link. I will also try to see if there is a way that I can marry those cs files with my SSIS Script project so I can implement them.

I'm working on that now. I don't know if you will want to retain the embedded double quotes or not, but that is what the data looks like to me, especially when the line begins and ends with "" (which should be """"). I'm not sure why there would be other double quotes embedded in the middle of the line, but (") is a common symbol that can represent things like inches or seconds or even a double prime.

ASKER

Oh, thank you!

The extra set "" inside of a column are were someone entered in a quote that someone said during data entry. If it is easier, we can strip all " and just leave the import file column delimited.

When using code like CSVFixer it might be best to let it do its thing on the original data and then do any stripping later. Or don't strip at all and preserve the data as closely as possible to what was provided.

I'm working on the adaptation as a library (with a constructor parameter for the delimiter) now, although I have a meeting coming up in a half hour. (It's weird -- I have more meetings and phone calls now that I am retired than I did as a full-time developer. I don't fully understand that.)

I have something that compiles and runs, but it seems to be stripping all quotes. I didn't expect that. Here's some tentative output - I'll have a further look after my meeting.
CleanData.txt

CSVFixer simply strips the unexpected double quotes. That works for rows 8-10 but not row 7. Your data is unusual in that the column delimiter is actually missing, twice, in row 7, and we can't blame the 'fixer' logic for that. What I can blame it for is using a mess of conditional logic instead of a state pattern, making it difficult to modify, and leaving me wondering what it actually does in other special cases. (It reminds me of the good old days of spaghetti code, but written as structured code.) I'm inclined to refactor it to use an FSM rather than try to simply modify it.

Now that I have a deeper feeling for the problem, I don't think regexes are a particularly good solution. I would probably have to pre-process the pipe delimiters anyway. A state-driven parser, on the other hand, would be extensible to allow for even more peculiarities, should they turn up later on.

Not every bad piece of data can be fixed automatically, and manual cleanup could be required on occasion. What would you want to do if the number of columns in the data exceeded the number of columns in the header? If it is less, we could add pipes at the end, or not (since it might signal a different case of a missing pipe in the middle). Just leaving both cases could allow a subsequent data flow to divert those rows to the error output. Actually, I like that last option.

I ended up writing my own state-driven delimited text parser that I can tweak as needed. My version will pivot a single column of potentially dirty delimited text into separate clean fields using a script transformation component (I could turn that into a custom component, even). I can tweak another copy to do what you need to do. It doesn't look like I am going to finish tonight, though, and I am booked for most of the next two days.

ASKER

Megan,

Thank you so much for all of your help and time. You are awesome. So I can continue learning, I will use that CSVFixer file(s) and see if I can come up with something

Thank you!

ASKER

Megan, I didn't see your comment with the "spaghetti code" in it until now. I would agree with your last statement about writing those rows out to an error file.

What I am writing is going to be flexible. There is a class for doing the parsing/cleaning, and then the script transformation main code can decide what to do with errors, and whether or not to pivot.

ASKER

Thank you so much!

I'm testing a "straight" version this morning. When that works I will create a custom version that does the delimiter restore.

ASKER

Awesome. Thank you again!

Having a quoted field that ends in ""| is especially messy. It is a valid sequence and normally it would be interpreted as two characters, an embedded double quote followed by an embedded delimiter character, and it would not end the field!

I can flag the fact that the field opened with ""<text>, which is not valid, and I can discard the 2nd opening quote and later treat ""| as simply "| (end of field) in that particular case. Does that seem like the right idea?

OK, next problem, and I should have noticed this earlier. Line 7 is the one with the missing delimiter, but it already contains the correct number of delimiters, 17 (18 columns), and re-inserting the missing ones gives it too many columns (20). What would you like to do about that? Or is this a problem with the test sample? It wouldn't happen if the "1"2.5" didn't appear twice.

ASKER

Megan, per your first comment, around 50 minutes ago, I think that sounds fine as long as the column would come out with |"<text>"|.

As per your latest comment, you are absolutely right and I apparently can't count. I'm honestly not sure what to do with that row. I guess for row, if we can remove the extra " in between the numbers we can go on from there. Or, if there is a way we can flag that row and send it to an error file. The actual values might actually be 12.5. I will have to research that at work tomorrow.

In the actual data set, I have also spotted an empty row. no delimiters, no text qualifiers, nothing except an {LF} at the end. I can work on removing that row manually or possibly having SSIS flag it for the error file.

I will add a return value to the parser that indicates if formatting exceptions were encountered, so that the script component can redirect those rows.

The state machine logic could also be adjusted to remove a single double-quote found in the middle of a quoted field. I think the best thing, however, would be to redirect the original rows to an error output, since in general there isn't any one best way to correct errors automatically.

ASKER

Awesome! To be honest, We want to try to fix things automatically but if certain rows need flagged and imported manually, that is certainly acceptable.

Thank you again!

The custom version of the code can fix certain "known" errors and not flag them as errors. This could lead to unexpected results, though, if the data turns out to be dirtier than you thought.

Are there any "known" errors right now to which that would apply? What should I do with the double-double quote case that I mentioned above (#42136122)?

ASKER

I don't believe the other files are any dirtier but they very well could be. Pretty much the known errors right now are what I sent. Double-double qutoed strings which just should be a single double quoted string "<text>" and errant " in a field (like "1"2.5"). The only other one that I know of is a completely blank row, other than a line feed character at the end.

As for double quoted fields, if it is a blank (null) field |""|, leave it as is. If it is a field that has data in it and it has more than one set of double quotes, remove the inside set, to leave |"<data>"|.

Also, if it is easier to completely strip out all ", I think that would work.

My code removes unnecessary double quotes when it parses. There is an option to enclose all fields in quotes when retrieving the result string from the parser; otherwise enclosing quotes are only used when required (Excel-style).

Alternatively, the individual fields can be retrieved as an array of strings, and then you can do anything you want with them. In array form, there are no enclosing quotes at all, and no doubling of quotes to escape them -- it's just the actual data. The field names from the first row can also be retrieved this way.

The internal representation is collections of fields, and I use LINQ queries to format the returns. It can be made to do whatever is needed.

We do need to recognize the double-double quotes at both ends, then? Is it an error if a field starts with one and doesn't end with one?

ASKER

I have briefly looked into LINQ and will look more into it.

If we are going to use a " as a text qualifier, then data in each column would need a "<data>". As for 2 sets of double-quotes in the same field, we would want to retain the outer-most set.

|"Data starts, "here and continues...""| becomes |"Data starts, here and continues..."|

As far as a field starting with one and not ending with one, yes that would be an error if the text delimiter in the package is set to a ".

I am generally following RFC 4180, as is the SSIS delimited flat file parser, and there is no concept of nesting of double quotes. A double-double quote in a quoted field is the escaped form of a double quote. That, in turn, and anything else, is escaped by enclosing the entire field in double quotes. So if you want field delimiters as data in a field, enclose the field in double quotes.

A lone double quote anywhere within a quoted field is not valid. What I can readily do is remove all of those. Within a quoted field (not counting the enclosing quotes), only even numbers of double quotes are permitted, and each such pair translates to a single double-quote.

ASKER

I will go with you here as you know more about standards and such as I do. All I know its that the SSIS flat file parser sees " within an already " field as other text or is just throws an error. I just assumed it would be best/easier to remove an double quotes inside of the string of data that don't belong.

Another thing I can do with lone double quotes within a quoted field is turn them into proper double quotes, so that they appear as double double-quotes. This might be the best default behavior for the code, now that think about it. It would not then be possible, however, for an actual embedded quote to be distinguished from a generated one. But this is a problem, in general, with trying to fix corrupted data.

ASKER

Whatever you think is best. I lean on your knowledge. Thank you!

Turning the invalid quotes into valid double quotes leaves them visible, and the text in which they appear remains readable, if not always grammatically correct.

When you have a numeric value, however, that happens to have a double quote embedded in it, that's a problem! Have you figured out what those were supposed to be?

ASKER

No , I haven't figured out what those are supposed to be yet as I assumed they were supposed to be different columns. I wont be able to figure out what they might be until tomorrow when I get into work and ask some colleagues.

I'm assuming the extra quote in the middle just needs stripped out of both of them or if it is easier we can just flag rows like that for the error disposition. I don't believe there are many of those in the data whereas I know that in just one of the files, ""Yes"" occurs around 3,300-ish times.

Converting lone quotes to valid quotes is relatively benign, since it doesn't remove information, and it should be OK to not flag that as an error. I would think that numerical data containing quote marks (that don't represent units such as seconds or inches) ought to be treated as an error unless you can determine exactly what happened to it and how to fix it (if it can be fixed).

I will wrap up the code I have and then upload a preliminary copy here. I am planning to publish it on github, but I am still working on the unit tests, and I have a paying job waiting so that may be a while. All the source will be there and it's not too hard to change.

ASKER

Thank you so much for your hard, dedicated work! I will be looking forward to it. Once you publish the code, I will obviosuly unit test on my end as well.

Again, if there ever is anything that I can assist you with, now or in the future, please don't hesitate to let me know.

My unit tests are finding bugs. I don't know whether to be sad or glad. I will still post the code shortly. You can help with additional testing.

ASKER

I most certainly will help any way that I can!

Thank you!

Here it is on github: https://github.com/mcb2k3/DelimitedTextCleaner

I decided to go ahead and push it, rather than drop a temporary copy here. I made so many changes that I needed to write unit tests to find all the new bugs, and after that there was no reason not to push. I haven't written tests for everything yet, but this is a good start, and I found quite a few bugs.

You can use the return value from the CleanText method to determine if there were any problems with quotes. You can then use the ReturnFields method to examine specific fields to determine if they are unusable, such as when numeric fields contain quote characters.

The ReturnText() method returns the entire cleaned line that was presented to CleanText. It can include quotes around or fields or only those that require quotes. It would be used in an SSIS script component to write clean text back out to a flat file.

The ReturnHeaders() and ReturnFields() methods can be used in a script component to pivot the field data out to individual data flow columns created by the script (running in asynchronous mode).

I am planning to add another method that will return the field data along with flags to indicate if quoting is required and/or if a given field contains invalid quotes, but I didn't get to that today. I can add it this week if it would be useful right away.

And oh yes, any issues with the code can go straight to github.

ASKER

Megan,

Thank you so much for all of your hard work. It is GREATLY appreciated! I will download and add the code at work tomorrow and test it. Issues I run into will certainly get posted to GitHub.

As for the other two methods, I think they would be good to have but only if you have time this week and if you didn't I completely understand.

Thank you again and I wish I could give you like a million points for this!

You are welcome. I racked up a lot of "learning points." These are use-it-or-lose-it skills, and I really don't want to lose my C#. Besides, I plan to use this is in a project I am working on.

ASKER

Great, I was hoping you had another reason for coding this other than me, while I do certainly appreciate it. My first programming classes were in the C Suite and I have since not used it very much and have been looking to get back into it myself.

Now, I just need to remember how to attach your "libraries" into my project and then integrate your methods into the script task....

You have taught me a lot, as I have peeked at your code on GitHub and I'm sure I'll learn more. Thanks again and I am happy to help you any way that I can, now and in the future.

The simplest way I have found to use a class file like this is to download Cleaner.cs and copy the file to the clipboard and then, with the SSIS VSTA script project open, paste it into the project. Add a "using Sql2Go.DelimitedFileCleaner;" (I think that's right) to main.cs and the class will be available. To update, copy the text of the new version to the clipboard and paste it into the Cleaner.cs file while it is open in the VSTA project.

You can compile the DLL and add it to the GAC (after adding an SNK file to the project), but I think that is more trouble than it is worth, and it makes it more difficult to debug.