Link to home
Start Free TrialLog in
Avatar of Paula DiTallo
Paula DiTalloFlag for United States of America

asked on

How Do I Approach Importing *.CSV Files with Known Issues Using Microsoft SSIS?

Fellow Techies,
I have vendor files arriving in *.csv format. Inevitably extra commas or line feeds, etc. will corrupt the file as in the one attached (bidsMade.csv).

Our shop uses SSIS for file imports/small ETLs.  How do I approach programmatically recognizing the corruption, then removing it, and finally, reporting on it?

Avatar of ste5an
ste5an
Flag of Germany image

Well, it is pretty simple, but most probably not the answer you look for: Use XML or another format which allows validation.

CSV requires that you escape column data, if it contains the column or row delimiter. The common approach is to quote the content in this case. The quote itself is then, when part of the content escaped by doubling it. So the CSV generator is buggy.

For your problem:  There is no general solution.

You must know the format, expected content and further constraints.
E.g. does the CSV represent a relation? Then you can simple load it and check the number of columns per row and whether the columns data type is always the same, you may even test the domain of the data.
Avatar of Bill Prew
Bill Prew

The only way you can approach this is to identify all the "bad data" scenarios you can, and for each come up with (1) a way to identify that case, and (2) a repeatable way to repair that bad data.

I suspect that may be fairly hard if not impossible for some cases, where a human looking at the data evaluates many data points in trying to determine if the data is valid, and if not what the correct data should be.

For example (simple example), if the import was supposed to contain 3 fields (columns) and the data came in as:

A1,B1,C1
A2,B2,C2,D2
A3,B3,C3


Clearly the second line has an error of some sort since there are too many columns in it.  So one of the data fields perhaps contained two values separated by a comma, and should have been quoted but wasn't.  But how do you know which of the following might be correct?

"A2,B2",C2,D2
A2,"B2,C2",D2
A2,B2,"C2,D2"


So some cases of bad input may not be solvable, others may be.  But work through the first thing I mentioned, which is trying to identify all the "edit checks" that could be performed to make sure the data coming in is valid, and what criteria can be used to detect each problem.  Once you have that move on to figuring out which ones might be correctable in code, rather than needing human review.

Once you have more details feel free to post more info so we can comment with more detail.


»bp
Using Bill's example from above:

A1,B1,C1
A2,B2,C2,D2
A3,B3,C3

Open in new window


You cannot say that the above CSV has an issue or not. Cause CSV means only comma delimited values per row. CSV does not impose that the entire file is a single relation, a table. But even if it is a table, it does not mean that columns at the row end are mandatory. The last column can be optional.

In general, without knowing anything about the CSV content and its constraints, you just cannot say that a CSV is "corrupted".

Which brings me back to your question

How do I approach programmatically recognizing the corruption [..]?
XML and XML Schema where exactly created to do this.

[..] then removing it, and finally, reporting on it?
I would only report it. The possibilities to "repair" it to wrong data is not neglectable, And the responsibility for correctness would shift from the data source to you the data consumer.
ASKER CERTIFIED SOLUTION
Avatar of Paula DiTallo
Paula DiTallo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I would strongly insist that the vendor delivers the data as XML with an appropriate XML schema.

And to be clear, the data as it is delivered is most probably an export bug, thus a problem of the vendor. But an honest advice: Don't ask for a specific solution. At least here in Germany, this would imply that problems arising from this special request are your problem, not the vendors.

So I would ask for XML and XML schema or a fixed CSV export (just a screenshot, not details about the solution).