Mass Imports + De-dupe + Efficiency

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

ASKER

Similar to one of my idea's, but I hadn't yet tried it. I wanted to get others advice and opinions first :-)

Would you recommend the dummy table be a real one... or a temporary table?

Also, when updating data back into the real table, would you do this via a cursor, or a bulk update based on a join? (I'm guessing the latter)

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

SOLUTION

lozzamoore

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

using cursor, you will face huge delay because of trigger again.
update all at once, or have some filter condition to make it a batch should be better.

ASKER

@JoeNuvo:

I like your methods, they seem very good and I will investigate these further.
When you talk about a 2nd Dummy table, and inserting to it using a distinct list, I can only see one problem with this.

The record contains 50-60 columns, but only a combination of 3 of these (for example) determines a duplicate. There is a 'creation date' field though, and the newest record would be classed as the master.

Otherwise, everything you've said so far makes sense.

@lozzamoore:

Could you define XML Shredding with a CLR a little more for me? Do you process each 'record' (with child nodes) or do you pass an entire 500,000 record set xml to CLR? Sorry, never heard of Shredding XML in CLR before. Sounds very interesting though.

lozzamoore

Sorry, it's probably my own colloquial term for extracting xml data into a relational table format.

The CLR uses the attached pseudocode structure (c#):

So the CLR does process each record, but this runs very fast.

L

private static void UnpackXMLtoArray(SqlXml myXML, ArrayList rowsArray)
{
            // Loop through all the Value elements loading the rowsArray
            XmlReader myXMLReader = myXML.CreateReader();
            XmlDocument XMLDoc = new XmlDocument();
            XMLDoc.Load(myXMLReader); // DOM is loaded in memory now

            XmlNameTable nsTab;
            nsTab = XMLDoc.NameTable;
            XmlNamespaceManager nsMgr = new XmlNamespaceManager(nsTab);
            nsMgr.AddNamespace("r", "http://www.web.com/yournamehere.xsd");

           // navigate through and capture data
           foreach (XmlNode item1 in XMLDoc.SelectNodes("/p:Data/p:item1", nsMgr))
           {
                ....
           }

}

Open in new window

SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

ASKER

Thanks for the assistance, both of you. I feel I have enough knowledge to create an efficient import routine now. Or at least, start the basics of one and design it.

I may have more questions regarding specifics later on, but those would be separate questions.

Thanks again!

ASKER

@JoeNuvo

I am learning a lot here, and this way of working is clearly much better than my old cursor based approaches.

With that CTE code you provided, it's excellent, I have found it great. I have done a simple one on a Course Table for now, (less child tables involved than person data has)

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN
FROM ImportStagingCourse_1
)
SELECT *
FROM CTE
LEFT OUTER JOIN ImportStagingCourse_1 IM1 ON IM1.IMPORT_GUID = CTE.IMPORT_GUID
WHERE RN <> 1
ORDER BY TITLE DESC

Open in new window

This lists me all of the duplicate records (I'm using selects so I can see the process)

Is it possible to get the IMPORT_GUID of both the RN = 1 record also?

i.e.
IMPORT_GUID of Record to Keep | IMPORT_GUID of Record to Delete

As columns?

Just thinking, I need the GUID of both, so that I can update foreign key relationships for child nodes.

try

;WITH CTE AS (
SELECT
	IMPORT_GUID,
	ROW_NUMBER() OVER (PARTITION BY CODE ORDER BY TITLE DESC) RN,
	DENSE_RANK() OVER (ORDER BY CODE) RK
FROM ImportStagingCourse_1
)
SELECT RN1.IMPORT_GUID RowToKeep, RN2.IMPORT_GUID RowToDelete
FROM CTE RN1
INNER JOIN RN2 ON RN1.RK = RN2.RK
-- join with ImportStagingCourse_1 if you want to see other field
WHERE RN1.RN = 1
-- AND RN2.RN <> 1 -- uncomment this line, if u want to see only row to delete

Open in new window