• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 470
  • Last Modified:

How can I best ensure CSV format consistency prior to DB import?

Hello - I'm a program manager for a software dev company that deals with very data intensive software. We often import massive amounts of data via an NT service from a CSV file on a daily basis.  

Typically, nothing ever changes in the file structure from day to day.  However, every once and a while, an extra column gets added accidentally or a new IT guy will put his hands on it and screw things up.

When this does occur, it creates a trickle down effect that results in substantial database issues.  These issues can easily take our support staff anywhere from 30 minutes to an entire day (or more) to resolve.

I know that there are ways that safeguards can be put in place to help validate the consistency of the data.  What are some of the best practices & techniques used to do something like this?

I know that we could look at column data and compare it against the last successful import for things like "are they both numeric columns?", etc.   That type of thing isn't fool proof because if position 3 is a zip code and position 4 is a numeric CustomerID field, what happens if a new column gets inserted in the front of the data?  A numeric only check won't cut it. Beyond that, I'm drawing a blank. There has to be a much more proven and tested approach out there than just that.   Keep in mind, this data can be ordered in any manner depending on the setup of the export file.   I'll list a few example of the fields below that could be included:   CustomerID, FirstName, LastName, Address, City, State, Zip, Phone, etc.

If anyone could help to provide ANY insight or suggestions on this, it would be greatly appreciated!!   We lose hundreds of hours in labor yearly because of it.

I know it's a very difficult question since it's just string comparison but someone has to got to have come up with a process to help ensure data consistency (or at least detect when something has gone haywire).

3 Solutions
slightwv (䄆 Netminder) Commented:
Great question!

Of the top of my head, I would look at some sort of customer defined data definition for the fields.

For example, in XML there is a DTD or Schema that can be applied to the XML that validates the contents.

In a database world, you have the data dictionary, constraints, etc...

I would look at implementing some sort of way for the users to define their own feed layout.

The problems here, using your example of a Zip Code, what makes p a 'valid' zip code?  does it have to be an actual zip code or just a set of numbers that follows a set of default formats (11111, 111112222, 11111-2222)?
Surendra NathTechnology LeadCommented:
If this is a problem that I have to solve, My initial approach will be like below

Option 1:

1) Use SSIS to load data into the flat table
2) Use script tasks and define the custom validators for each field
3) if the script task results in false, redirect the record to bad record flat table or else to the good one.

Option 2:

if your team dont have enough SSIS experience,

1) bulk load the file (There are chances the bcp itself will fail some times if the data type does not match, so it is good to have the data type of the flat table to charecter at the first).

2) after the bulk load, right a stored procedure, which will have a custom validation.

Although Option 2 still solves the issue, it does not have the superior capabilities of SSIS, so I recomend you to go with Option 1
>>this data can be ordered in any manner
does this mean, column order in any sequence?

>>depending on the setup of the export file.
is "setup" manual?
please provide more on this aspect

what are the source systems? (e.g. Excel)

are you already using staging tables?

what part is played by each of these: Oracle? MS SQL Server? MS Access?
Build your data science skills into a career

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

DavidSenior Oracle Database AdministratorCommented:
Books and articles galore on data cleansing, of course. Since you included Oracle, I'll speak to the general ETL practice of (1) getting from the customer the EXPLICIT business rules of what is acceptable; (2) Configuring the load tool (example, Oracle Datapump) to handle outlying data (example of bad data moved to a BAD file for manual cleansing, count limits to force a job kill); (3) Configuring the database so that the target tables and columns have pre-defined constraints.  

Example, date fields must begin with a four digit year, within two years of the current calendar year, concatenated to a two-digit month between 01 and 12, then concatenated to a two digit day between 01 and 31.  Oracle has many date-conversion functions, so a field value of 20130231 would fall out as bad.

In a recent contract, the customer was attempting to consolidate several hundred inventory sources into the Oracle eBusiness Suite (ERP).  For a certain group of sources, the data coming in had to have a certain level of cleaniness -- say, five percent bad data was tolerable.  Another group, perhaps old archive records, could be accepted without validation.  All depends upon the business requirements.
Make whoever damages the data do the cleaning until the next problem happens, when the mantle moves on :)

Seriously, though, it really rather depends on the complexity of your data, but you can guard against gross violations - like adding a column - with a small program that reads the first (or second, or whatever) line in as just straight text and compares it against a set of rules. If it fails then call a human.

That will cope with the extra columns, etc. As for the problem of the "new IT guy", this is a process problem: set permissions on the CSV file so it can't be touched between source and consuming program.


trs28Author Commented:
There are some great ideas here .... some of them we have in place already; some of them we've never explored.   We're very database savvy but unfortunately we all wear many hats here so no one has the ability to focus even 60% of their time to tune database processes like these.    

Thanks for all the input everyone!
trs28Author Commented:
Thanks again everyone!   The info gave me a bunch of ideas at how we can tighten the safeguards we do already have in place.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now