Link to home
Start Free TrialLog in
Avatar of philfordham
philfordhamFlag for United Kingdom of Great Britain and Northern Ireland

asked on

How do I dedupe multiple database fields

I have a source table with the schema:
ContactName varchar
Tel1 varchar
Tel2 varchar
Tel3 varchar
DateToContact datetime

This contains around 5million records.
When the DateToCall is within the next 14 days I append the data into a table with the same schema. At the same time I also remove data where the DateToCall has passed. So the table only ever contains the data that needs to be called in the next 14 days.

When the data is appended I need to check whether the data in any of the 3 telephone fields is in any of the 3 fields that is already in the table. So that no telephone number appears more than once across all 3 fields.

I am currently using a while loop to do this, looking for duplicates before each insert, but its very inefficient. Its taking around (16 hours to complete each night).

I thought of using a table to hold a list of the numbers in the destination table, populated with the data in all 3 tel fields but this goes out of date as soon as record is appended, so I am back to procesing the data one row at a time to ensure there are no duplicates.

Has anyone got any better ideas?

The code I am currently using is as follows:

SET NOCOUNT ON

DECLARE @record_id INT
DECLARE @ContactName varchar(200)
DECLARE @tel1 varchar(200)
DECLARE @tel2 varchar(200)
DECLARE @tel3 varchar(200)
DECLARE @DateToContact datetime
DECLARE @record_count1 INT
DECLARE @record_count2 INT
DECLARE @record_count3 INT

IF OBJECT_ID('TempDB..#mydata','U') IS NOT NULL BEGIN
      DROP TABLE #mydata
END

CREATE TABLE #mydata (
      id INT IDENTITY(1,1),
      ContactName varchar(200) NULL,
      tel1 varchar(200) NULL,
      tel2 varchar(200) NULL,
      tel3 varchar(200) NULL,
      DateToContact DATETIME NULL,
      CONSTRAINT [PK_mydata] PRIMARY KEY CLUSTERED ([id] ASC)      
)

INSERT #mydata
      (ContactName,
      tel1,
      tel2,
      tel3,
      DateToContact)
      
SELECT
      ContactName,
      tel1,
      tel2,
      tel3,
      DateToContact
FROM
      dbo.SourceTable
WHERE
      DATEDIFF(dd,GETDATE(),DateToContact) < 15 AND
      (tel1 IS NOT NULL OR tel2 IS NOT NULL OR tel3 IS NOT null)

DECLARE @counter INT
DECLARE @idcount int
DECLARE @name varchar(40)
DECLARE @sql varchar(max)

SET @counter = 1
SELECT @idcount = MAX(id)FROM #mydata

WHILE @counter <= @idcount BEGIN

      SELECT
            @ContactName = ContactName,
            @tel1 = tel1,
            @tel2 = tel2,
            @tel3 = tel3,
            @DateToContact = DateToContact
      FROM
            #mydata
      WHERE
            id = @counter
      
      SET @record_count1 = 0
      SET @record_count2 = 0
      SET @record_count3 = 0
      
      SELECT
            @record_count1 = COUNT(source_record_id)
      FROM
            DestinationTable
      WHERE
            (tel1 IN (@tel1, @tel2, @tel3))
            
      SELECT
            @record_count2 = COUNT(source_record_id)
      FROM
            DestinationTable
      WHERE
            (tel2 IN (@tel1, @tel2, @tel3))
            
      SELECT
            @record_count3 = COUNT(source_record_id)
      FROM
            DestinationTable
      WHERE
            (tel3 IN (@tel1, @tel2, @tel3))      
            
      IF @record_count1 + @record_count2 + @record_count3 = 0 BEGIN      
            PRINT 'Inserting no ' + @tel1
            INSERT
                  DestinationTable
                  (ContactName,
                  tel1,
                  tel2,
                  tel3,
                  DateToContact)
            VALUES
                  (@ContactName,
                  @tel1,
                  @tel2,
                  @tel3,
                  @DateToContact)
      END
      SET @counter = @counter + 1
END

DROP TABLE #mydata

SET NOCOUNT OFF

Avatar of Kevin Cross
Kevin Cross
Flag of United States of America image

philfordham,

Are you definitely using SQL Server 2005?

I have a thought that you could do the following:

+Remove records that don't meet the criteria of 14 days stated in question.
+Select current records that meet the criteria (can use a table variable or WITH - common table expression - to avoid temp table) and UNPIVOT so you get ContactName, DateToCall and Tel columns with three rows for each contact representing each number.
+UNPIVOT the existing data.
+Select the new data that does not exist in the unpivoted version of existing data and then PIVOT back for insert into table.

This would work with SQL Server 2005, so please confirm that and which pieces if any you need an example of.

Regards,

--isa
If you have #mydata  prepared then one command should be enough:
INSERT INTO DestinationTable (ContactName, tel1, tel2, tel3, DateToContact)
     SELECT ContactName, tel1, tel2, tel3, DateToContact
       FROM #mydata 
      WHERE (tel1 IS NULL OR
             tel1 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
                          UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
                          UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
        AND (tel2 IS NULL OR
             tel2 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
                          UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
                          UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
        AND (tel3 IS NULL OR
             tel3 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
                          UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
                          UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))

Open in new window

Here is what I was talking about, but you can have a look at Pavel's suggestion as it probably works with what you have already.

-- step 1: delete data in destination that is outdated
--select *
delete
from dbo.contacts_destination
where DateToContact < DATEDIFF(day, 0, getdate())
;

-- step 2: insert new data
with d
as
(
   -- 2a: get list of existing phone numbers;
   -- step 1 took care of removing those that are not valid.
   select Tel
   from dbo.contacts_destination
   unpivot (
      Tel
      for TelColumn
      in (tel1, tel2, tel3)
   ) upvt
)
, s
as
(
   -- 2b: get list of new phone numbers that meet criteria,
   -- but also don't exist in the list of existing numbers.
   select ContactName, Tel, TelColumn, DateToContact
   from (
      select *
      from dbo.contacts_source
      where DateToContact >= DATEDIFF(day, 0, getdate())
      and DateToContact < DATEDIFF(day, 0, getdate())+14
   ) tbl
   unpivot (
      Tel
      for TelColumn
      in (tel1, tel2, tel3)
   ) upvt
   where not exists (select 1 from d where d.Tel = upvt.Tel)
)
-- 2c: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s        
pivot (
   max(Tel)
   for TelColumn
   in (tel1, tel2, tel3)
) pvt
;

You can replace contacts_xxx with your appropriate table names.

Hope that helps.

--isa
Of course, you have to remove duplicities from #mydata as the first step...
Exactly and in case I forgot to show that above, that was the point of unpivoting so that you had one column on which to group distinct by telephone number or use row_number() over() to select the ContactName and/or DateToContact you want if a duplicate is found.
Hmm goood Kevin, I supposed UNPIVOT is some theory or idea but you posted complete solution...
Yeah, I had time on my hands and couldn't help but test out my theory and this was working code -- up to group by which I removed so I could ensure everything was working right. :)
Avatar of philfordham

ASKER

Thanks for the comments guys. I will test out your suggestions this morning. I have a feeling that the problem with Pavel's solution will be removing the duplicates from #mydata before the insert. A simple select distinct doesn't work because there are duplicates like this:

Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123

I only want to insert one of the above records. I am only interested in making sure there are no duplicate telephone numbers.

I am definitely using SQl server 2005 so I can try out the UNPIVOT theory as well.
Above duplicity is let say clear but what should be removed in the following case:

Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,123

Is one duplicity enough to remove one of them or not?

Also another question rised: How do you count the DateToContact in duplicate entries? Does it have some impact to removal decision?
Here is updated example with dedupe method I mentioned using row_number().

-- step 2: insert new data
with d
as
(
  -- 2a: get list of existing phone numbers;
  -- step 1 took care of removing those that are not valid.
  select Tel
  from dbo.contacts_destination
  unpivot (
     Tel
     for TelColumn
     in (tel1, tel2, tel3)
  ) upvt
)
, s
as
(
  -- 2b: get list of new phone numbers that meet criteria,
  -- but also don't exist in the list of existing numbers.
  select ContactName, Tel, TelColumn, DateToContact
  from (
     select *
      from dbo.contacts_source
     where DateToContact >= DATEDIFF(day, 0, getdate())
     and DateToContact < DATEDIFF(day, 0, getdate())+14
  ) tbl
  unpivot (
     Tel
     for TelColumn
     in (tel1, tel2, tel3)
  ) upvt
  where not exists (select 1 from d where d.Tel = upvt.Tel)
)
, s_dedupas(   -- 2c: dedupe the new data.   select ContactName, Tel, TelColumn, DateToContact   from (      select *, ROW_NUMBER() OVER(PARTITION BY Tel ORDER BY DateToContact) rn      from s   ) derived   where rn = 1)
-- 2d: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s_dedup
pivot (
  max(Tel)
  for TelColumn
  in (tel1, tel2, tel3)
) pvt
;

For a better explanation of the dedupe technique, see this article:

Analytical SQL : Where do you rank? - http:A_1555.html

The Tel can be treated like the sales rep in the article and the top 1 customer is the one with the earliest DateToContact.  So back in your terms, you are ranking the telephone number found by its contact date.  If a later contact ends up with no phone numbers, then they won't be imported thus eliminating unneeded rows.  Otherwise, you should see the subsequent rows have NULLs where that telephone number already appeared in another contact.

Hope that makes sense.

--isa
That is just my understanding.  As you can tell by Pavel's questions that there can be other interpretations.

My results would look like this:

Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,NULL
My curent (very slow) solution would remove the entire row from the example:

Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123

The first record inserted would be the one that makes it into the table. I hadn't thought about just replacing the duplicate numbers with NULL. I am not sure whether the customer would prefer to keep them I will have to ask.

I have tested both of your original suggestions but both seem to result in there being duplicates in the table.

Kevin's updated solution works perfectly as far as I can tell, I just wish I understood why. I will run some test against the full data set as the only issue I have found is that it seems to slow up considerably as the amount of data in the destination table increases. I'll be back to let you know what I find out. Thanks for all the input.
ASKER CERTIFIED SOLUTION
Avatar of Kevin Cross
Kevin Cross
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
What to say?  Kevin's solution is perfect.
Great solution, with a good explanation and some interesting ideas