Solved

Mass Imports + De-dupe + Efficiency

Posted on 2011-03-22
12
371 Views
Last Modified: 2012-08-14
This is a question more about practices/methods, than code itself.

I have used SQL for a long time now and can perform a variety of tasks. However one task has been bothering me now, and that is... "How best to import data?"

I've read up on BCP and know that Bulk Inserts are best for performance, I have tried SSIS and DTS too to import data. However I am failing still to find a harmonious balance of my requirements.

My current process is something like this:
      1) 3rd Party calls my WebService (nightly) and passes me records in XML (Person Data)
            - This data can contain a contact, and may also contain multiple address records too - As child nodes in XML.
            
      2) I clean up some of the records in VB.net first (i.e. fix invalid dates)
      
      3) I then offload each record node to a Stored Procedure in the database.
            Person1 -> ImportContactRecord
            Person1 Address1 -> ImportAddressRecord
            Person1 Address2 -> ImportAddressRecord
            
      Within each Stored Procedure are a number of de-duplication rules. For example
            #1 Reference Number
            #2 Firstname, Surname, Date of Birth
            #3 Firstname, Surname, Email Address
            
      If I find one single exact match in the existing Table, assume the record is the same and do an UPDATE.
      
      If after trying each of the 3 rules above, I do not find a match (or I find multiple matches), then I perform an INSERT.
      
This is working well and it's keeping our database clean.

However... this is very slow. Especially given that there are triggers on the tables that perform data updates too (I can't remove these just yet... that is under discussion with software company)

For 100 or maybe 5000 records. It's okay and works.

However, as the first dataload they passed us over 600,000 records and this has taken nearly a week to run through. (I personally believe the de-dupe rules are the biggest hinderence).

I have come to the conclusion that at the point of adding de-dupe rules to any mass import radically complicates the process (and code) and has a dramatic impact on performance.

Therefore I am asking for help and advice on how to import such data into SQL (2005 / 2008) databases, maintaining multiple de-dupe rules.

Surely there has to be a better way. I know that processing one record at a time is not optimal, but I don't know how else I can de-dupe (against itself, and against existing data) with multiple rules.

I appreciate any advice given in this thread. I hope my explanation is clear and detailed enough.

Edit: I tried set this to a higher point value, considering the complexity. However EE have imposed a limit due to SQL tags.
0
Comment
Question by:Narusegawa
  • 5
  • 5
  • 2
12 Comments
 
LVL 11

Accepted Solution

by:
JoeNuvo earned 375 total points
Comment Utility
for your process, I will be perform something like this

1. create dummy table, with 1 extra fields "IsExists bit default 0"
    remark : table should have enough index to perform well for task no.3
2. insert all data into dummy table
3. run update on dummy  by perform join versus real table, mark IsExists to 1
4. dump out data which is not exists, and perform bulk insert into real table
5. update data which already exists back into real table.
0
 

Author Comment

by:Narusegawa
Comment Utility
Similar to one of my idea's, but I hadn't yet tried it. I wanted to get others advice and opinions first :-)

Would you recommend the dummy table be a real one... or a temporary table?

Also, when updating data back into the real table, would you do this via a cursor, or a bulk update based on a join? (I'm guessing the latter)
0
 
LVL 11

Assisted Solution

by:JoeNuvo
JoeNuvo earned 375 total points
Comment Utility
and, since you mention "de-dupe against itself".
you should need process to delete the duplicate on dummy table, too.

one of the simple method is
  - create 2nd dummy table
  - insert into 2nd dummy table select distinct <field lists> from 1st dummy table
and then you just use 2nd dummy table to find exists on real table etc.
0
 
LVL 7

Assisted Solution

by:lozzamoore
lozzamoore earned 125 total points
Comment Utility
Hi there.

I worked with a similar system recently, where we were loading large blocks of XML.

The approach we took was to bulk load the XML as is straight from the source system using (in our case) a stored procedure taking an XML imput. If it's easier, you could use the SSIS XML Source data pump, or a custom exe.

The stored proc then used a CLR function to "shred" the XML into a staging table. (We did try using the SQL 2005 built in XML methods first, which functionally worked fine, but a CLR function turned out to be significantly faster.)

All your de-dup processing and insert/update logic can now be done in an efficient set based way from the staging table to the final table.

I hope this helps.
0
 
LVL 11

Expert Comment

by:JoeNuvo
Comment Utility
using cursor, you will face huge delay because of trigger again.
update all at once, or have some filter condition to make it a batch should be better.
0
 

Author Comment

by:Narusegawa
Comment Utility
@JoeNuvo:

I like your methods, they seem very good and I will investigate these further.
When you talk about a 2nd Dummy table, and inserting to it using a distinct list, I can only see one problem with this.

The record contains 50-60 columns, but only a combination of 3 of these (for example) determines a duplicate. There is a 'creation date' field though, and the newest record would be classed as the master.

Otherwise, everything you've said so far makes sense.

@lozzamoore:

Could you define XML Shredding with a CLR a little more for me? Do you process each 'record' (with child nodes) or do you pass an entire 500,000 record set xml to CLR? Sorry, never heard of Shredding XML in CLR before. Sounds very interesting though.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 7

Expert Comment

by:lozzamoore
Comment Utility
Sorry, it's probably my own colloquial term for extracting xml data into a relational table format.

The CLR uses the attached pseudocode structure (c#):

So the CLR does process each record, but this runs very fast.

L


private static void UnpackXMLtoArray(SqlXml myXML, ArrayList rowsArray)
{
            // Loop through all the Value elements loading the rowsArray
            XmlReader myXMLReader = myXML.CreateReader();
            XmlDocument XMLDoc = new XmlDocument();
            XMLDoc.Load(myXMLReader); // DOM is loaded in memory now

            XmlNameTable nsTab;
            nsTab = XMLDoc.NameTable;
            XmlNamespaceManager nsMgr = new XmlNamespaceManager(nsTab);
            nsMgr.AddNamespace("r", "http://www.web.com/yournamehere.xsd");

           // navigate through and capture data
           foreach (XmlNode item1 in XMLDoc.SelectNodes("/p:Data/p:item1", nsMgr))
           {
                ....
           }

}

Open in new window

0
 
LVL 11

Assisted Solution

by:JoeNuvo
JoeNuvo earned 375 total points
Comment Utility
If the "duplicate" condition met, the 'create date' will be uses to determine which one will be kept, is it?

in that case, we still able to uses SQL Query to delete the "old" duplicate from the dummy table.

Example code to delete the duplicate value, consider by FirstName, SurName and DOB
;WITH CTE AS (
SELECT
	[UniqueField],
	ROW_NUMBER() OVER (PARTITION BY FirstName, Surname, DOB ORDER BY createdate DESC) RN
FROM DummyTable
)
DELETE DummyTable
FROM CTE
WHERE DummyTable.[UniqueField] = CTE.[UniqueField]
AND RN <> 1

Open in new window

Condition: Dummy Table must have any kind of UNIQUE field, for ex IDENTITY field.
if doesn't have, you need to create extra field to be identity
0
 

Author Comment

by:Narusegawa
Comment Utility
Thanks for the assistance, both of you. I feel I have enough knowledge to create an efficient import routine now. Or at least, start the basics of one and design it.

I may have more questions regarding specifics later on, but those would be separate questions.

Thanks again!
0
 

Author Comment

by:Narusegawa
Comment Utility
@JoeNuvo

I am learning a lot here, and this way of working is clearly much better than my old cursor based approaches.

With that CTE code you provided, it's excellent, I have found it great. I have done a simple one on a Course Table for now, (less child tables involved than person data has)

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN
FROM ImportStagingCourse_1
)
SELECT *
FROM CTE
LEFT OUTER JOIN ImportStagingCourse_1 IM1 ON IM1.IMPORT_GUID = CTE.IMPORT_GUID
WHERE RN <> 1
ORDER BY TITLE DESC

Open in new window


This lists me all of the duplicate records (I'm using selects so I can see the process)

Is it possible to get the IMPORT_GUID of both the RN = 1 record also?

i.e.
    IMPORT_GUID of Record to Keep |  IMPORT_GUID of Record to Delete

As columns?

Just thinking, I need the GUID of both, so that I can update foreign key relationships for child nodes.
0
 
LVL 11

Expert Comment

by:JoeNuvo
Comment Utility
try

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN,
	DENSE_RANK() OVER (ORDER BY CODE) RK
FROM ImportStagingCourse_1
)
SELECT RN1.IMPORT_GUID RowToKeep, RN2.IMPORT_GUID RowToDelete
FROM CTE RN1
INNER JOIN RN2 ON RN1.RK = RN2.RK
-- join with ImportStagingCourse_1 if you want to see other field
WHERE RN1.RN = 1
-- AND RN2.RN <> 1 -- uncomment this line, if u want to see only row to delete

Open in new window


0
 

Author Comment

by:Narusegawa
Comment Utility
This works great, and very quick too.

I have another question that you might be able to assist with, it is related to the updating my tmp tables with ID's from the master tables now.

In your original answer this would be the step 3, if there are multiple records that match the condition

 3 run update on dummy by perform join versus real table, mark IsExists to 1

Opportunity for more points :-)

http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SQL-Server-2005/Q_27034389.html
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Data architecture is an important aspect in Software as a Service (SaaS) delivery model. This article is a study on the database of a single-tenant application that could be extended to support multiple tenants. The application is web-based develope…
The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now