?
Solved

How do I dedupe multiple database fields

Posted on 2010-01-12
15
Medium Priority
?
736 Views
Last Modified: 2012-05-08
I have a source table with the schema:
ContactName varchar
Tel1 varchar
Tel2 varchar
Tel3 varchar
DateToContact datetime

This contains around 5million records.
When the DateToCall is within the next 14 days I append the data into a table with the same schema. At the same time I also remove data where the DateToCall has passed. So the table only ever contains the data that needs to be called in the next 14 days.

When the data is appended I need to check whether the data in any of the 3 telephone fields is in any of the 3 fields that is already in the table. So that no telephone number appears more than once across all 3 fields.

I am currently using a while loop to do this, looking for duplicates before each insert, but its very inefficient. Its taking around (16 hours to complete each night).

I thought of using a table to hold a list of the numbers in the destination table, populated with the data in all 3 tel fields but this goes out of date as soon as record is appended, so I am back to procesing the data one row at a time to ensure there are no duplicates.

Has anyone got any better ideas?

The code I am currently using is as follows:

SET NOCOUNT ON

DECLARE @record_id INT
DECLARE @ContactName varchar(200)
DECLARE @tel1 varchar(200)
DECLARE @tel2 varchar(200)
DECLARE @tel3 varchar(200)
DECLARE @DateToContact datetime
DECLARE @record_count1 INT
DECLARE @record_count2 INT
DECLARE @record_count3 INT

IF OBJECT_ID('TempDB..#mydata','U') IS NOT NULL BEGIN
      DROP TABLE #mydata
END

CREATE TABLE #mydata (
      id INT IDENTITY(1,1),
      ContactName varchar(200) NULL,
      tel1 varchar(200) NULL,
      tel2 varchar(200) NULL,
      tel3 varchar(200) NULL,
      DateToContact DATETIME NULL,
      CONSTRAINT [PK_mydata] PRIMARY KEY CLUSTERED ([id] ASC)      
)

INSERT #mydata
      (ContactName,
      tel1,
      tel2,
      tel3,
      DateToContact)
      
SELECT
      ContactName,
      tel1,
      tel2,
      tel3,
      DateToContact
FROM
      dbo.SourceTable
WHERE
      DATEDIFF(dd,GETDATE(),DateToContact) < 15 AND
      (tel1 IS NOT NULL OR tel2 IS NOT NULL OR tel3 IS NOT null)

DECLARE @counter INT
DECLARE @idcount int
DECLARE @name varchar(40)
DECLARE @sql varchar(max)

SET @counter = 1
SELECT @idcount = MAX(id)FROM #mydata

WHILE @counter <= @idcount BEGIN

      SELECT
            @ContactName = ContactName,
            @tel1 = tel1,
            @tel2 = tel2,
            @tel3 = tel3,
            @DateToContact = DateToContact
      FROM
            #mydata
      WHERE
            id = @counter
      
      SET @record_count1 = 0
      SET @record_count2 = 0
      SET @record_count3 = 0
      
      SELECT
            @record_count1 = COUNT(source_record_id)
      FROM
            DestinationTable
      WHERE
            (tel1 IN (@tel1, @tel2, @tel3))
            
      SELECT
            @record_count2 = COUNT(source_record_id)
      FROM
            DestinationTable
      WHERE
            (tel2 IN (@tel1, @tel2, @tel3))
            
      SELECT
            @record_count3 = COUNT(source_record_id)
      FROM
            DestinationTable
      WHERE
            (tel3 IN (@tel1, @tel2, @tel3))      
            
      IF @record_count1 + @record_count2 + @record_count3 = 0 BEGIN      
            PRINT 'Inserting no ' + @tel1
            INSERT
                  DestinationTable
                  (ContactName,
                  tel1,
                  tel2,
                  tel3,
                  DateToContact)
            VALUES
                  (@ContactName,
                  @tel1,
                  @tel2,
                  @tel3,
                  @DateToContact)
      END
      SET @counter = @counter + 1
END

DROP TABLE #mydata

SET NOCOUNT OFF

0
Comment
Question by:philfordham
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 5
  • 3
15 Comments
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 26298415
philfordham,

Are you definitely using SQL Server 2005?

I have a thought that you could do the following:

+Remove records that don't meet the criteria of 14 days stated in question.
+Select current records that meet the criteria (can use a table variable or WITH - common table expression - to avoid temp table) and UNPIVOT so you get ContactName, DateToCall and Tel columns with three rows for each contact representing each number.
+UNPIVOT the existing data.
+Select the new data that does not exist in the unpivoted version of existing data and then PIVOT back for insert into table.

This would work with SQL Server 2005, so please confirm that and which pieces if any you need an example of.

Regards,

--isa
0
 
LVL 42

Expert Comment

by:pcelba
ID: 26298797
If you have #mydata  prepared then one command should be enough:
INSERT INTO DestinationTable (ContactName, tel1, tel2, tel3, DateToContact)
     SELECT ContactName, tel1, tel2, tel3, DateToContact
       FROM #mydata 
      WHERE (tel1 IS NULL OR
             tel1 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
                          UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
                          UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
        AND (tel2 IS NULL OR
             tel2 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
                          UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
                          UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
        AND (tel3 IS NULL OR
             tel3 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
                          UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
                          UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))

Open in new window

0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 26298918
Here is what I was talking about, but you can have a look at Pavel's suggestion as it probably works with what you have already.

-- step 1: delete data in destination that is outdated
--select *
delete
from dbo.contacts_destination
where DateToContact < DATEDIFF(day, 0, getdate())
;

-- step 2: insert new data
with d
as
(
   -- 2a: get list of existing phone numbers;
   -- step 1 took care of removing those that are not valid.
   select Tel
   from dbo.contacts_destination
   unpivot (
      Tel
      for TelColumn
      in (tel1, tel2, tel3)
   ) upvt
)
, s
as
(
   -- 2b: get list of new phone numbers that meet criteria,
   -- but also don't exist in the list of existing numbers.
   select ContactName, Tel, TelColumn, DateToContact
   from (
      select *
      from dbo.contacts_source
      where DateToContact >= DATEDIFF(day, 0, getdate())
      and DateToContact < DATEDIFF(day, 0, getdate())+14
   ) tbl
   unpivot (
      Tel
      for TelColumn
      in (tel1, tel2, tel3)
   ) upvt
   where not exists (select 1 from d where d.Tel = upvt.Tel)
)
-- 2c: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s        
pivot (
   max(Tel)
   for TelColumn
   in (tel1, tel2, tel3)
) pvt
;

You can replace contacts_xxx with your appropriate table names.

Hope that helps.

--isa
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 42

Expert Comment

by:pcelba
ID: 26298940
Of course, you have to remove duplicities from #mydata as the first step...
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 26299005
Exactly and in case I forgot to show that above, that was the point of unpivoting so that you had one column on which to group distinct by telephone number or use row_number() over() to select the ContactName and/or DateToContact you want if a duplicate is found.
0
 
LVL 42

Expert Comment

by:pcelba
ID: 26299023
Hmm goood Kevin, I supposed UNPIVOT is some theory or idea but you posted complete solution...
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 26299089
Yeah, I had time on my hands and couldn't help but test out my theory and this was working code -- up to group by which I removed so I could ensure everything was working right. :)
0
 

Author Comment

by:philfordham
ID: 26301176
Thanks for the comments guys. I will test out your suggestions this morning. I have a feeling that the problem with Pavel's solution will be removing the duplicates from #mydata before the insert. A simple select distinct doesn't work because there are duplicates like this:

Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123

I only want to insert one of the above records. I am only interested in making sure there are no duplicate telephone numbers.

I am definitely using SQl server 2005 so I can try out the UNPIVOT theory as well.
0
 
LVL 42

Expert Comment

by:pcelba
ID: 26305059
Above duplicity is let say clear but what should be removed in the following case:

Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,123

Is one duplicity enough to remove one of them or not?

Also another question rised: How do you count the DateToContact in duplicate entries? Does it have some impact to removal decision?
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 26305573
Here is updated example with dedupe method I mentioned using row_number().

-- step 2: insert new data
with d
as
(
  -- 2a: get list of existing phone numbers;
  -- step 1 took care of removing those that are not valid.
  select Tel
  from dbo.contacts_destination
  unpivot (
     Tel
     for TelColumn
     in (tel1, tel2, tel3)
  ) upvt
)
, s
as
(
  -- 2b: get list of new phone numbers that meet criteria,
  -- but also don't exist in the list of existing numbers.
  select ContactName, Tel, TelColumn, DateToContact
  from (
     select *
      from dbo.contacts_source
     where DateToContact >= DATEDIFF(day, 0, getdate())
     and DateToContact < DATEDIFF(day, 0, getdate())+14
  ) tbl
  unpivot (
     Tel
     for TelColumn
     in (tel1, tel2, tel3)
  ) upvt
  where not exists (select 1 from d where d.Tel = upvt.Tel)
)
, s_dedupas(   -- 2c: dedupe the new data.   select ContactName, Tel, TelColumn, DateToContact   from (      select *, ROW_NUMBER() OVER(PARTITION BY Tel ORDER BY DateToContact) rn      from s   ) derived   where rn = 1)
-- 2d: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s_dedup
pivot (
  max(Tel)
  for TelColumn
  in (tel1, tel2, tel3)
) pvt
;

For a better explanation of the dedupe technique, see this article:

Analytical SQL : Where do you rank? - http:A_1555.html

The Tel can be treated like the sales rep in the article and the top 1 customer is the one with the earliest DateToContact.  So back in your terms, you are ranking the telephone number found by its contact date.  If a later contact ends up with no phone numbers, then they won't be imported thus eliminating unneeded rows.  Otherwise, you should see the subsequent rows have NULLs where that telephone number already appeared in another contact.

Hope that makes sense.

--isa
0
 
LVL 60

Expert Comment

by:Kevin Cross
ID: 26305598
That is just my understanding.  As you can tell by Pavel's questions that there can be other interpretations.

My results would look like this:

Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,NULL
0
 

Author Comment

by:philfordham
ID: 26307127
My curent (very slow) solution would remove the entire row from the example:

Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123

The first record inserted would be the one that makes it into the table. I hadn't thought about just replacing the duplicate numbers with NULL. I am not sure whether the customer would prefer to keep them I will have to ask.

I have tested both of your original suggestions but both seem to result in there being duplicates in the table.

Kevin's updated solution works perfectly as far as I can tell, I just wish I understood why. I will run some test against the full data set as the only issue I have found is that it seems to slow up considerably as the amount of data in the destination table increases. I'll be back to let you know what I find out. Thanks for all the input.
0
 
LVL 60

Accepted Solution

by:
Kevin Cross earned 2000 total points
ID: 26307456
Sorry I tried to explain each bit, but I understand it is a bit thick...

-- 2a: get list of existing phone numbers;
  -- step 1 took care of removing those that are not valid.
  select Tel
  from dbo.contacts_destination
  unpivot (
     Tel
     for TelColumn
     in (tel1, tel2, tel3)
  ) upvt

This is the piece that is slowing this down most likely combined with the:

 where not exists (select 1 from d where d.Tel = upvt.Tel)

What these portions of the code is trying to do is get the destination data to have all the telephone numbers in one column so that we can use a not exists statement on it; however, this is probably not taking advantage of indexing too well.

Maybe try without unpivoting the destination.

-- step 2: insert new data
with s
as
(
  -- 2b: get list of new phone numbers that meet criteria,
  -- but also don't exist in the list of existing numbers.
  select ContactName, Tel, TelColumn, DateToContact
  from (
     select *
     from dbo.contacts_source
     where DateToContact >= DATEDIFF(day, 0, getdate())
     and DateToContact < DATEDIFF(day, 0, getdate())+14
  ) tbl
  unpivot (
     Tel
     for TelColumn
     in (tel1, tel2, tel3)
  ) upvt
  where not exists (select 1 from dbo.contacts_destination where Tel1 = upvt.Tel) and not exists (select 1 from dbo.contacts_destination where Tel2 = upvt.Tel) and not exists (select 1 from dbo.contacts_destination where Tel3 = upvt.Tel)
)
, s_dedup
as
(
   -- 2c: dedupe the new data.
   select ContactName, Tel, TelColumn, DateToContact
   from (
      select *, ROW_NUMBER() OVER(PARTITION BY Tel ORDER BY DateToContact) rn
      from s
   ) derived
   where rn = 1
)
-- 2d: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s_dedup
pivot (
  max(Tel)
  for TelColumn
  in (tel1, tel2, tel3)
) pvt
;

What this is trying to do is utilize the indexes on the Tel1-Tel3 columns which you can add if don't already exist and this should help.

And for explanation sake, I will reiterate what I am doing:
+Getting source data filtered to the next 14 days in derived table aliased as "tbl".
+Using UNPIVOT to get all three telephone numbers from "tbl" into one column to simplify the de-duplication process.  This may be unnecessary, but how I thought it would work best.
+With the data in one column, I then use NOT EXISTS to ensure that the number doesn't appear in destination table in any telx column.
+With the data still in one column, using ranking function I am getting top 1 occurrence of each telephone number (de-dupe).
+Final selection/insert pivots data back to your structure of 5 columns.


Regards,

--isa
0
 
LVL 42

Expert Comment

by:pcelba
ID: 26307533
What to say?  Kevin's solution is perfect.
0
 

Author Closing Comment

by:philfordham
ID: 31675972
Great solution, with a good explanation and some interesting ideas
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

What if you have to shut down the entire Citrix infrastructure for hardware maintenance, software upgrades or "the unknown"? I developed this plan for "the unknown" and hope that it helps you as well. This article explains how to properly shut down …
In this article, we’ll look at how to deploy ProxySQL.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question