Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Mass Imports + De-dupe + Efficiency

Posted on 2011-03-22
12
Medium Priority
?
389 Views
Last Modified: 2012-08-14
This is a question more about practices/methods, than code itself.

I have used SQL for a long time now and can perform a variety of tasks. However one task has been bothering me now, and that is... "How best to import data?"

I've read up on BCP and know that Bulk Inserts are best for performance, I have tried SSIS and DTS too to import data. However I am failing still to find a harmonious balance of my requirements.

My current process is something like this:
      1) 3rd Party calls my WebService (nightly) and passes me records in XML (Person Data)
            - This data can contain a contact, and may also contain multiple address records too - As child nodes in XML.
            
      2) I clean up some of the records in VB.net first (i.e. fix invalid dates)
      
      3) I then offload each record node to a Stored Procedure in the database.
            Person1 -> ImportContactRecord
            Person1 Address1 -> ImportAddressRecord
            Person1 Address2 -> ImportAddressRecord
            
      Within each Stored Procedure are a number of de-duplication rules. For example
            #1 Reference Number
            #2 Firstname, Surname, Date of Birth
            #3 Firstname, Surname, Email Address
            
      If I find one single exact match in the existing Table, assume the record is the same and do an UPDATE.
      
      If after trying each of the 3 rules above, I do not find a match (or I find multiple matches), then I perform an INSERT.
      
This is working well and it's keeping our database clean.

However... this is very slow. Especially given that there are triggers on the tables that perform data updates too (I can't remove these just yet... that is under discussion with software company)

For 100 or maybe 5000 records. It's okay and works.

However, as the first dataload they passed us over 600,000 records and this has taken nearly a week to run through. (I personally believe the de-dupe rules are the biggest hinderence).

I have come to the conclusion that at the point of adding de-dupe rules to any mass import radically complicates the process (and code) and has a dramatic impact on performance.

Therefore I am asking for help and advice on how to import such data into SQL (2005 / 2008) databases, maintaining multiple de-dupe rules.

Surely there has to be a better way. I know that processing one record at a time is not optimal, but I don't know how else I can de-dupe (against itself, and against existing data) with multiple rules.

I appreciate any advice given in this thread. I hope my explanation is clear and detailed enough.

Edit: I tried set this to a higher point value, considering the complexity. However EE have imposed a limit due to SQL tags.
0
Comment
Question by:Narusegawa
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 5
  • 2
12 Comments
 
LVL 11

Accepted Solution

by:
JoeNuvo earned 1500 total points
ID: 35187938
for your process, I will be perform something like this

1. create dummy table, with 1 extra fields "IsExists bit default 0"
    remark : table should have enough index to perform well for task no.3
2. insert all data into dummy table
3. run update on dummy  by perform join versus real table, mark IsExists to 1
4. dump out data which is not exists, and perform bulk insert into real table
5. update data which already exists back into real table.
0
 

Author Comment

by:Narusegawa
ID: 35187964
Similar to one of my idea's, but I hadn't yet tried it. I wanted to get others advice and opinions first :-)

Would you recommend the dummy table be a real one... or a temporary table?

Also, when updating data back into the real table, would you do this via a cursor, or a bulk update based on a join? (I'm guessing the latter)
0
 
LVL 11

Assisted Solution

by:JoeNuvo
JoeNuvo earned 1500 total points
ID: 35187968
and, since you mention "de-dupe against itself".
you should need process to delete the duplicate on dummy table, too.

one of the simple method is
  - create 2nd dummy table
  - insert into 2nd dummy table select distinct <field lists> from 1st dummy table
and then you just use 2nd dummy table to find exists on real table etc.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 7

Assisted Solution

by:lozzamoore
lozzamoore earned 500 total points
ID: 35187970
Hi there.

I worked with a similar system recently, where we were loading large blocks of XML.

The approach we took was to bulk load the XML as is straight from the source system using (in our case) a stored procedure taking an XML imput. If it's easier, you could use the SSIS XML Source data pump, or a custom exe.

The stored proc then used a CLR function to "shred" the XML into a staging table. (We did try using the SQL 2005 built in XML methods first, which functionally worked fine, but a CLR function turned out to be significantly faster.)

All your de-dup processing and insert/update logic can now be done in an efficient set based way from the staging table to the final table.

I hope this helps.
0
 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35187980
using cursor, you will face huge delay because of trigger again.
update all at once, or have some filter condition to make it a batch should be better.
0
 

Author Comment

by:Narusegawa
ID: 35188213
@JoeNuvo:

I like your methods, they seem very good and I will investigate these further.
When you talk about a 2nd Dummy table, and inserting to it using a distinct list, I can only see one problem with this.

The record contains 50-60 columns, but only a combination of 3 of these (for example) determines a duplicate. There is a 'creation date' field though, and the newest record would be classed as the master.

Otherwise, everything you've said so far makes sense.

@lozzamoore:

Could you define XML Shredding with a CLR a little more for me? Do you process each 'record' (with child nodes) or do you pass an entire 500,000 record set xml to CLR? Sorry, never heard of Shredding XML in CLR before. Sounds very interesting though.
0
 
LVL 7

Expert Comment

by:lozzamoore
ID: 35188554
Sorry, it's probably my own colloquial term for extracting xml data into a relational table format.

The CLR uses the attached pseudocode structure (c#):

So the CLR does process each record, but this runs very fast.

L


private static void UnpackXMLtoArray(SqlXml myXML, ArrayList rowsArray)
{
            // Loop through all the Value elements loading the rowsArray
            XmlReader myXMLReader = myXML.CreateReader();
            XmlDocument XMLDoc = new XmlDocument();
            XMLDoc.Load(myXMLReader); // DOM is loaded in memory now

            XmlNameTable nsTab;
            nsTab = XMLDoc.NameTable;
            XmlNamespaceManager nsMgr = new XmlNamespaceManager(nsTab);
            nsMgr.AddNamespace("r", "http://www.web.com/yournamehere.xsd");

           // navigate through and capture data
           foreach (XmlNode item1 in XMLDoc.SelectNodes("/p:Data/p:item1", nsMgr))
           {
                ....
           }

}

Open in new window

0
 
LVL 11

Assisted Solution

by:JoeNuvo
JoeNuvo earned 1500 total points
ID: 35195357
If the "duplicate" condition met, the 'create date' will be uses to determine which one will be kept, is it?

in that case, we still able to uses SQL Query to delete the "old" duplicate from the dummy table.

Example code to delete the duplicate value, consider by FirstName, SurName and DOB
;WITH CTE AS (
SELECT
	[UniqueField],
	ROW_NUMBER() OVER (PARTITION BY FirstName, Surname, DOB ORDER BY createdate DESC) RN
FROM DummyTable
)
DELETE DummyTable
FROM CTE
WHERE DummyTable.[UniqueField] = CTE.[UniqueField]
AND RN <> 1

Open in new window

Condition: Dummy Table must have any kind of UNIQUE field, for ex IDENTITY field.
if doesn't have, you need to create extra field to be identity
0
 

Author Comment

by:Narusegawa
ID: 35205999
Thanks for the assistance, both of you. I feel I have enough knowledge to create an efficient import routine now. Or at least, start the basics of one and design it.

I may have more questions regarding specifics later on, but those would be separate questions.

Thanks again!
0
 

Author Comment

by:Narusegawa
ID: 35239696
@JoeNuvo

I am learning a lot here, and this way of working is clearly much better than my old cursor based approaches.

With that CTE code you provided, it's excellent, I have found it great. I have done a simple one on a Course Table for now, (less child tables involved than person data has)

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN
FROM ImportStagingCourse_1
)
SELECT *
FROM CTE
LEFT OUTER JOIN ImportStagingCourse_1 IM1 ON IM1.IMPORT_GUID = CTE.IMPORT_GUID
WHERE RN <> 1
ORDER BY TITLE DESC

Open in new window


This lists me all of the duplicate records (I'm using selects so I can see the process)

Is it possible to get the IMPORT_GUID of both the RN = 1 record also?

i.e.
    IMPORT_GUID of Record to Keep |  IMPORT_GUID of Record to Delete

As columns?

Just thinking, I need the GUID of both, so that I can update foreign key relationships for child nodes.
0
 
LVL 11

Expert Comment

by:JoeNuvo
ID: 35245546
try

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN,
	DENSE_RANK() OVER (ORDER BY CODE) RK
FROM ImportStagingCourse_1
)
SELECT RN1.IMPORT_GUID RowToKeep, RN2.IMPORT_GUID RowToDelete
FROM CTE RN1
INNER JOIN RN2 ON RN1.RK = RN2.RK
-- join with ImportStagingCourse_1 if you want to see other field
WHERE RN1.RN = 1
-- AND RN2.RN <> 1 -- uncomment this line, if u want to see only row to delete

Open in new window


0
 

Author Comment

by:Narusegawa
ID: 35747250
This works great, and very quick too.

I have another question that you might be able to assist with, it is related to the updating my tmp tables with ID's from the master tables now.

In your original answer this would be the step 3, if there are multiple records that match the condition

 3 run update on dummy by perform join versus real table, mark IsExists to 1

Opportunity for more points :-)

http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SQL-Server-2005/Q_27034389.html
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

by Mark Wills Attending one of Rob Farley's seminars the other day, I heard the phrase "The Accidental DBA" and fell in love with it. It got me thinking about the plight of the newcomer to SQL Server...  So if you are the accidental DBA, or, simp…
Naughty Me. While I was changing the database name from DB1 to DB_PROD1 (yep it's not real database name ^v^), I changed the database name and notified my application fellows that I did it. They turn on the application, and everything is working. A …
Sometimes it takes a new vantage point, apart from our everyday security practices, to truly see our Active Directory (AD) vulnerabilities. We get used to implementing the same techniques and checking the same areas for a breach. This pattern can re…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…

670 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question