Mass Imports + De-dupe + Efficiency
Posted on 2011-03-22
This is a question more about practices/methods, than code itself.
I have used SQL for a long time now and can perform a variety of tasks. However one task has been bothering me now, and that is... "How best to import data?"
I've read up on BCP and know that Bulk Inserts are best for performance, I have tried SSIS and DTS too to import data. However I am failing still to find a harmonious balance of my requirements.
My current process is something like this:
1) 3rd Party calls my WebService (nightly) and passes me records in XML (Person Data)
- This data can contain a contact, and may also contain multiple address records too - As child nodes in XML.
2) I clean up some of the records in VB.net first (i.e. fix invalid dates)
3) I then offload each record node to a Stored Procedure in the database.
Person1 -> ImportContactRecord
Person1 Address1 -> ImportAddressRecord
Person1 Address2 -> ImportAddressRecord
Within each Stored Procedure are a number of de-duplication rules. For example
#1 Reference Number
#2 Firstname, Surname, Date of Birth
#3 Firstname, Surname, Email Address
If I find one single exact match in the existing Table, assume the record is the same and do an UPDATE.
If after trying each of the 3 rules above, I do not find a match (or I find multiple matches), then I perform an INSERT.
This is working well and it's keeping our database clean.
However... this is very slow. Especially given that there are triggers on the tables that perform data updates too (I can't remove these just yet... that is under discussion with software company)
For 100 or maybe 5000 records. It's okay and works.
However, as the first dataload they passed us over 600,000 records and this has taken nearly a week to run through. (I personally believe the de-dupe rules are the biggest hinderence).
I have come to the conclusion that at the point of adding de-dupe rules to any mass import radically complicates the process (and code) and has a dramatic impact on performance.
Therefore I am asking for help and advice on how to import such data into SQL (2005 / 2008) databases, maintaining multiple de-dupe rules.
Surely there has to be a better way. I know that processing one record at a time is not optimal, but I don't know how else I can de-dupe (against itself, and against existing data) with multiple rules.
I appreciate any advice given in this thread. I hope my explanation is clear and detailed enough.
Edit: I tried set this to a higher point value, considering the complexity. However EE have imposed a limit due to SQL tags.