Link to home
Start Free TrialLog in
Avatar of Narusegawa
Narusegawa

asked on

Mass Imports + De-dupe + Efficiency

This is a question more about practices/methods, than code itself.

I have used SQL for a long time now and can perform a variety of tasks. However one task has been bothering me now, and that is... "How best to import data?"

I've read up on BCP and know that Bulk Inserts are best for performance, I have tried SSIS and DTS too to import data. However I am failing still to find a harmonious balance of my requirements.

My current process is something like this:
      1) 3rd Party calls my WebService (nightly) and passes me records in XML (Person Data)
            - This data can contain a contact, and may also contain multiple address records too - As child nodes in XML.
            
      2) I clean up some of the records in VB.net first (i.e. fix invalid dates)
      
      3) I then offload each record node to a Stored Procedure in the database.
            Person1 -> ImportContactRecord
            Person1 Address1 -> ImportAddressRecord
            Person1 Address2 -> ImportAddressRecord
            
      Within each Stored Procedure are a number of de-duplication rules. For example
            #1 Reference Number
            #2 Firstname, Surname, Date of Birth
            #3 Firstname, Surname, Email Address
            
      If I find one single exact match in the existing Table, assume the record is the same and do an UPDATE.
      
      If after trying each of the 3 rules above, I do not find a match (or I find multiple matches), then I perform an INSERT.
      
This is working well and it's keeping our database clean.

However... this is very slow. Especially given that there are triggers on the tables that perform data updates too (I can't remove these just yet... that is under discussion with software company)

For 100 or maybe 5000 records. It's okay and works.

However, as the first dataload they passed us over 600,000 records and this has taken nearly a week to run through. (I personally believe the de-dupe rules are the biggest hinderence).

I have come to the conclusion that at the point of adding de-dupe rules to any mass import radically complicates the process (and code) and has a dramatic impact on performance.

Therefore I am asking for help and advice on how to import such data into SQL (2005 / 2008) databases, maintaining multiple de-dupe rules.

Surely there has to be a better way. I know that processing one record at a time is not optimal, but I don't know how else I can de-dupe (against itself, and against existing data) with multiple rules.

I appreciate any advice given in this thread. I hope my explanation is clear and detailed enough.

Edit: I tried set this to a higher point value, considering the complexity. However EE have imposed a limit due to SQL tags.
ASKER CERTIFIED SOLUTION
Avatar of JoeNuvo
JoeNuvo
Flag of Viet Nam image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Narusegawa
Narusegawa

ASKER

Similar to one of my idea's, but I hadn't yet tried it. I wanted to get others advice and opinions first :-)

Would you recommend the dummy table be a real one... or a temporary table?

Also, when updating data back into the real table, would you do this via a cursor, or a bulk update based on a join? (I'm guessing the latter)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
using cursor, you will face huge delay because of trigger again.
update all at once, or have some filter condition to make it a batch should be better.
@JoeNuvo:

I like your methods, they seem very good and I will investigate these further.
When you talk about a 2nd Dummy table, and inserting to it using a distinct list, I can only see one problem with this.

The record contains 50-60 columns, but only a combination of 3 of these (for example) determines a duplicate. There is a 'creation date' field though, and the newest record would be classed as the master.

Otherwise, everything you've said so far makes sense.

@lozzamoore:

Could you define XML Shredding with a CLR a little more for me? Do you process each 'record' (with child nodes) or do you pass an entire 500,000 record set xml to CLR? Sorry, never heard of Shredding XML in CLR before. Sounds very interesting though.
Sorry, it's probably my own colloquial term for extracting xml data into a relational table format.

The CLR uses the attached pseudocode structure (c#):

So the CLR does process each record, but this runs very fast.

L


private static void UnpackXMLtoArray(SqlXml myXML, ArrayList rowsArray)
{
            // Loop through all the Value elements loading the rowsArray
            XmlReader myXMLReader = myXML.CreateReader();
            XmlDocument XMLDoc = new XmlDocument();
            XMLDoc.Load(myXMLReader); // DOM is loaded in memory now

            XmlNameTable nsTab;
            nsTab = XMLDoc.NameTable;
            XmlNamespaceManager nsMgr = new XmlNamespaceManager(nsTab);
            nsMgr.AddNamespace("r", "http://www.web.com/yournamehere.xsd");

           // navigate through and capture data
           foreach (XmlNode item1 in XMLDoc.SelectNodes("/p:Data/p:item1", nsMgr))
           {
                ....
           }

}

Open in new window

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the assistance, both of you. I feel I have enough knowledge to create an efficient import routine now. Or at least, start the basics of one and design it.

I may have more questions regarding specifics later on, but those would be separate questions.

Thanks again!
@JoeNuvo

I am learning a lot here, and this way of working is clearly much better than my old cursor based approaches.

With that CTE code you provided, it's excellent, I have found it great. I have done a simple one on a Course Table for now, (less child tables involved than person data has)

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN
FROM ImportStagingCourse_1
)
SELECT *
FROM CTE
LEFT OUTER JOIN ImportStagingCourse_1 IM1 ON IM1.IMPORT_GUID = CTE.IMPORT_GUID
WHERE RN <> 1
ORDER BY TITLE DESC

Open in new window


This lists me all of the duplicate records (I'm using selects so I can see the process)

Is it possible to get the IMPORT_GUID of both the RN = 1 record also?

i.e.
    IMPORT_GUID of Record to Keep |  IMPORT_GUID of Record to Delete

As columns?

Just thinking, I need the GUID of both, so that I can update foreign key relationships for child nodes.
try

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN,
	DENSE_RANK() OVER (ORDER BY CODE) RK
FROM ImportStagingCourse_1
)
SELECT RN1.IMPORT_GUID RowToKeep, RN2.IMPORT_GUID RowToDelete
FROM CTE RN1
INNER JOIN RN2 ON RN1.RK = RN2.RK
-- join with ImportStagingCourse_1 if you want to see other field
WHERE RN1.RN = 1
-- AND RN2.RN <> 1 -- uncomment this line, if u want to see only row to delete

Open in new window


This works great, and very quick too.

I have another question that you might be able to assist with, it is related to the updating my tmp tables with ID's from the master tables now.

In your original answer this would be the step 3, if there are multiple records that match the condition

 3 run update on dummy by perform join versus real table, mark IsExists to 1

Opportunity for more points :-)

https://www.experts-exchange.com/questions/27034389/Update-with-join.html