Link to home
Start Free TrialLog in
Avatar of chokka
chokkaFlag for United States of America

asked on

SSIS Package to import / validate around half million records

We are importing (Bulk Insert) of around half million records to the staging table and then to the production table.

More often column data type or size validation fails the bulk insert process from Staging table to the Production table.

I will have to redesign the process by SSIS packages by inserting all the good records from Staging to the Production and load the bad / error out data to the staging table. Those bad or error out data with proper validation message.

Validation message is based on first error out column. If the all the columns in that row are erroring, we can consider with the first column error and reload the data after fixing that data.

Example Error Message: Column PurchaseDate has invalid data

How do I achieve this complete workflow SSIS package. What are the controls, I need to use to start with?

Please :)
Avatar of HainKurt
HainKurt
Flag of Canada image

maybe you can create a staging table, where all columns are nvarchar/varchar, then upload into here as is...
then write a query to filter out bad ones...
then export from this staging to be imported into prod...
Avatar of chokka

ASKER

My staging table .. has nvarchar and accept all format data. To give an example, if you sent character in the datefield, my staging table accept.

Current process is doing bulk insert from flat file to staging table and then to prod table.

So, when i get any invalid data .. staging table to prod table fails. I am having hard time to find out which row of record  and then which column of record is causing the failure.

Better option for me is to have SSIS Package with a conditional split, but need to get better advice on handling the error out records while processing from Staging to Prod.
what is error message?

just write a query with bunch of ISNumeric(), ISDate(), ISNULL() conditions to filter out bad data...
then use this view to get good data to import into production
Avatar of Jim Horn
In the past I've done this by adding two columns in your staging table:  Error Count, and Error Description.

Then write a Stored Procedure that queries the staging table for whatever you are trying to validate, such as dates are dates, numbers are numbers, state codes that are not valid, fuzzy squirrels are before smiling mooses, whatever.  UPDATE queries increment Error Count and write an Error Description if any rows fail validations.

Then in your SSIS package ONLY pump from staging to prod those rows where Error Count = 0, then come up with a graceful way to handle rows where Error Count > 0.
Avatar of chokka

ASKER

@Lvl66, I am sorry. Little confused. I like the idea of having two columns.

But you are mentioning to do validation on staging table.

Staging table is designed to import the data from flat file.  In staging table, we accept any form of source data.

Do you think its a good idea to do validation on staging table?
For starters my name is Jim and not Level 66.

In my experience, and your mileage may vary, validation is usually done on a staging table.  Most of my work has been dumping files into a table with all varchar columns, which guarantees that every row makes it in the table, and then once it's there you can do all your data validation,  Then INSERT the good rows into production.  

The idea is that data is ONLY inserted into the final destination table once it is deemed to have passed all validation rules.

It is possible to validate in your SSIS package.  Benefits are the load will run faster, but costs are it's a little riskier and you lose some control.
Avatar of chokka

ASKER

@Jim, sorry for the delay.  Does the Staging Table can still accept the raw source data? Or, you are saying us to validate the source data even before inserting into Staging Table.
The intent of a staging table using my reasoning above is to have a separate place to do all of your data validations BEFORE inserting the data into the ultimate destination.   Usually I prefix this with etl_ or ssis_, and I (controversal stuff follows) leave the data in that staging table in case there is a prod support issue to deal with, and when the feed runs again first thing I do is TRUNCATE TABLE on the staging table to start over.

>Or, you are saying us to validate the source data even before inserting into Staging Table.
Think of it this way..

ETL - Export Transform Load - do all your validations in the SSIS package and INSERT into the final SQL Server table, which is defined with the correct data types, with the understanding that all rows pass validation.

ELT - Export Load Transform (also known as 'Dear Source Data ... I don't trust you') - Load the data into a table defined as all varchar's, maybe even one column for the entire row, then execute T-SQL to do all of your validation such as dates are dates, numbers are numbers, state codes are valid, then insert the data into your final destination.

ELT solutions cost more to build in hours and $$'s as it's more code to make the feed bullet-proof, but it also reduces the chance of a failure or a partial load when there are errors.
Avatar of chokka

ASKER

@Jim, I completely agree with you. In our case, when we get the feed data / source data .. there might be some bad data in the 100 thousand row or some where in the middle.. we need to pick those bad data and sent it back to the vendor.

We currently truncate the ssis_ stage table on every load process.
ASKER CERTIFIED SOLUTION
Avatar of Jim Horn
Jim Horn
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of chokka

ASKER

Great, I didn't realize to populate those bad row of record in an error out table.  Based on those error out table records, i can validate and sent out the details.

Thank you so much !!!!
Avatar of chokka

ASKER

Thank you so much !!! Very helpful