First of all, I am using SQL Server 2000.
I need to develop a complex ETL system that impors data daily from text files we receive from one of our external partners. While I have done daily imports and transformations before, this is adding a level of complexity in that we are now processing 'delta' data which represents data that has changed in some way (added or deleted or column values changed).
I am staging the data into tables from which I will process into the 'base' tables. The data we receive can represent a new account that has been added by the partner, the lack of a previously sent account record which represents the account is no longer active, data that has changed (e.g. name or adddress change), or data that is exactly like what is already in our system.
I initially do a match on all columns between the staging tables and base tables and delete rows from the staging tables that are identical, so what I am left with is a true 'delta.'
The base table contains an effective date and expiration date which is indirectly a part of the original data that is in the staging table (it is contained in a header record of the text file and I parse it into a variable up front.
No records are ever deleted from the base table. I just set the expiration date to indicate it is no longer valid.
When a record is added to the base table, the effective date is set based on the header record date.
If a record is in the base table and not in the staging table, that means it has been deleted from the primary source. In that case, I want to set the expiration date to the file header date - 1, but not physically delete the record from the base table.
If the record in the staging table is also in the base table, then that indicates that one or more column values changed. In that case, I want to set the expiration date of the existing record as above, and then add a new record with all the data from the staging table and set the effective date as above.
As such, there should always only be one record for each active account with a null expiration date and there will be no records with a null expiration date if the account is not active.
We can receive data that represents 65,000 or more accounts per day. On top of that, the text file (fixed length) contains 75 repeating groups which represent 'service codes'. I am parsing them to a separate table. I have to handle service codes the same way I do the primary account data (effective date/expiration date logic).
I am trying to determine the best way to proceed with this. It appears that I am going to need to do some inserts (new accounts, or accounts that have changed) and updates (expiration date of original record for accounts that have changed data or have been 'deleted').
I am having some trouble trying to conceptualize the order of processing, what should be done in what order to make sure I handle all possibilities, etc. I am sure one or more of the experts have done something similar to this (without the use of a third-party tool, which is not a possibility here).
One thing I am thinking of doing is using INSTEAD OF triggers with an update. I can check in the trigger and see if I am updating only the expiration date, in which case, I do just that (which represents the record being 'deleted'). If any other columns are being updated, then I would capture the expiration date in the inserted table and actually update the original record with that value and then insert a new record with a null expiration date and the new effective date. This way, updates and 'deletes' would be handled through the trigger code and any new accounts can just be inserted into the table.
Does this sound like a proper approach? Any other thoughts on how it should be tackled?
Start Free Trial