philfordham
asked on
How do I dedupe multiple database fields
I have a source table with the schema:
ContactName varchar
Tel1 varchar
Tel2 varchar
Tel3 varchar
DateToContact datetime
This contains around 5million records.
When the DateToCall is within the next 14 days I append the data into a table with the same schema. At the same time I also remove data where the DateToCall has passed. So the table only ever contains the data that needs to be called in the next 14 days.
When the data is appended I need to check whether the data in any of the 3 telephone fields is in any of the 3 fields that is already in the table. So that no telephone number appears more than once across all 3 fields.
I am currently using a while loop to do this, looking for duplicates before each insert, but its very inefficient. Its taking around (16 hours to complete each night).
I thought of using a table to hold a list of the numbers in the destination table, populated with the data in all 3 tel fields but this goes out of date as soon as record is appended, so I am back to procesing the data one row at a time to ensure there are no duplicates.
Has anyone got any better ideas?
The code I am currently using is as follows:
SET NOCOUNT ON
DECLARE @record_id INT
DECLARE @ContactName varchar(200)
DECLARE @tel1 varchar(200)
DECLARE @tel2 varchar(200)
DECLARE @tel3 varchar(200)
DECLARE @DateToContact datetime
DECLARE @record_count1 INT
DECLARE @record_count2 INT
DECLARE @record_count3 INT
IF OBJECT_ID('TempDB..#mydata ','U') IS NOT NULL BEGIN
DROP TABLE #mydata
END
CREATE TABLE #mydata (
id INT IDENTITY(1,1),
ContactName varchar(200) NULL,
tel1 varchar(200) NULL,
tel2 varchar(200) NULL,
tel3 varchar(200) NULL,
DateToContact DATETIME NULL,
CONSTRAINT [PK_mydata] PRIMARY KEY CLUSTERED ([id] ASC)
)
INSERT #mydata
(ContactName,
tel1,
tel2,
tel3,
DateToContact)
SELECT
ContactName,
tel1,
tel2,
tel3,
DateToContact
FROM
dbo.SourceTable
WHERE
DATEDIFF(dd,GETDATE(),Date ToContact) < 15 AND
(tel1 IS NOT NULL OR tel2 IS NOT NULL OR tel3 IS NOT null)
DECLARE @counter INT
DECLARE @idcount int
DECLARE @name varchar(40)
DECLARE @sql varchar(max)
SET @counter = 1
SELECT @idcount = MAX(id)FROM #mydata
WHILE @counter <= @idcount BEGIN
SELECT
@ContactName = ContactName,
@tel1 = tel1,
@tel2 = tel2,
@tel3 = tel3,
@DateToContact = DateToContact
FROM
#mydata
WHERE
id = @counter
SET @record_count1 = 0
SET @record_count2 = 0
SET @record_count3 = 0
SELECT
@record_count1 = COUNT(source_record_id)
FROM
DestinationTable
WHERE
(tel1 IN (@tel1, @tel2, @tel3))
SELECT
@record_count2 = COUNT(source_record_id)
FROM
DestinationTable
WHERE
(tel2 IN (@tel1, @tel2, @tel3))
SELECT
@record_count3 = COUNT(source_record_id)
FROM
DestinationTable
WHERE
(tel3 IN (@tel1, @tel2, @tel3))
IF @record_count1 + @record_count2 + @record_count3 = 0 BEGIN
PRINT 'Inserting no ' + @tel1
INSERT
DestinationTable
(ContactName,
tel1,
tel2,
tel3,
DateToContact)
VALUES
(@ContactName,
@tel1,
@tel2,
@tel3,
@DateToContact)
END
SET @counter = @counter + 1
END
DROP TABLE #mydata
SET NOCOUNT OFF
ContactName varchar
Tel1 varchar
Tel2 varchar
Tel3 varchar
DateToContact datetime
This contains around 5million records.
When the DateToCall is within the next 14 days I append the data into a table with the same schema. At the same time I also remove data where the DateToCall has passed. So the table only ever contains the data that needs to be called in the next 14 days.
When the data is appended I need to check whether the data in any of the 3 telephone fields is in any of the 3 fields that is already in the table. So that no telephone number appears more than once across all 3 fields.
I am currently using a while loop to do this, looking for duplicates before each insert, but its very inefficient. Its taking around (16 hours to complete each night).
I thought of using a table to hold a list of the numbers in the destination table, populated with the data in all 3 tel fields but this goes out of date as soon as record is appended, so I am back to procesing the data one row at a time to ensure there are no duplicates.
Has anyone got any better ideas?
The code I am currently using is as follows:
SET NOCOUNT ON
DECLARE @record_id INT
DECLARE @ContactName varchar(200)
DECLARE @tel1 varchar(200)
DECLARE @tel2 varchar(200)
DECLARE @tel3 varchar(200)
DECLARE @DateToContact datetime
DECLARE @record_count1 INT
DECLARE @record_count2 INT
DECLARE @record_count3 INT
IF OBJECT_ID('TempDB..#mydata
DROP TABLE #mydata
END
CREATE TABLE #mydata (
id INT IDENTITY(1,1),
ContactName varchar(200) NULL,
tel1 varchar(200) NULL,
tel2 varchar(200) NULL,
tel3 varchar(200) NULL,
DateToContact DATETIME NULL,
CONSTRAINT [PK_mydata] PRIMARY KEY CLUSTERED ([id] ASC)
)
INSERT #mydata
(ContactName,
tel1,
tel2,
tel3,
DateToContact)
SELECT
ContactName,
tel1,
tel2,
tel3,
DateToContact
FROM
dbo.SourceTable
WHERE
DATEDIFF(dd,GETDATE(),Date
(tel1 IS NOT NULL OR tel2 IS NOT NULL OR tel3 IS NOT null)
DECLARE @counter INT
DECLARE @idcount int
DECLARE @name varchar(40)
DECLARE @sql varchar(max)
SET @counter = 1
SELECT @idcount = MAX(id)FROM #mydata
WHILE @counter <= @idcount BEGIN
SELECT
@ContactName = ContactName,
@tel1 = tel1,
@tel2 = tel2,
@tel3 = tel3,
@DateToContact = DateToContact
FROM
#mydata
WHERE
id = @counter
SET @record_count1 = 0
SET @record_count2 = 0
SET @record_count3 = 0
SELECT
@record_count1 = COUNT(source_record_id)
FROM
DestinationTable
WHERE
(tel1 IN (@tel1, @tel2, @tel3))
SELECT
@record_count2 = COUNT(source_record_id)
FROM
DestinationTable
WHERE
(tel2 IN (@tel1, @tel2, @tel3))
SELECT
@record_count3 = COUNT(source_record_id)
FROM
DestinationTable
WHERE
(tel3 IN (@tel1, @tel2, @tel3))
IF @record_count1 + @record_count2 + @record_count3 = 0 BEGIN
PRINT 'Inserting no ' + @tel1
INSERT
DestinationTable
(ContactName,
tel1,
tel2,
tel3,
DateToContact)
VALUES
(@ContactName,
@tel1,
@tel2,
@tel3,
@DateToContact)
END
SET @counter = @counter + 1
END
DROP TABLE #mydata
SET NOCOUNT OFF
If you have #mydata prepared then one command should be enough:
INSERT INTO DestinationTable (ContactName, tel1, tel2, tel3, DateToContact)
SELECT ContactName, tel1, tel2, tel3, DateToContact
FROM #mydata
WHERE (tel1 IS NULL OR
tel1 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
AND (tel2 IS NULL OR
tel2 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
AND (tel3 IS NULL OR
tel3 NOT IN (SELECT tel1 FROM DestinationTable WHERE tel1 IS NOT NULL
UNION ALL SELECT tel2 FROM DestinationTable WHERE tel2 IS NOT NULL
UNION ALL SELECT tel3 FROM DestinationTable WHERE tel3 IS NOT NULL))
Here is what I was talking about, but you can have a look at Pavel's suggestion as it probably works with what you have already.
-- step 1: delete data in destination that is outdated
--select *
delete
from dbo.contacts_destination
where DateToContact < DATEDIFF(day, 0, getdate())
;
-- step 2: insert new data
with d
as
(
-- 2a: get list of existing phone numbers;
-- step 1 took care of removing those that are not valid.
select Tel
from dbo.contacts_destination
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
)
, s
as
(
-- 2b: get list of new phone numbers that meet criteria,
-- but also don't exist in the list of existing numbers.
select ContactName, Tel, TelColumn, DateToContact
from (
select *
from dbo.contacts_source
where DateToContact >= DATEDIFF(day, 0, getdate())
and DateToContact < DATEDIFF(day, 0, getdate())+14
) tbl
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
where not exists (select 1 from d where d.Tel = upvt.Tel)
)
-- 2c: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s
pivot (
max(Tel)
for TelColumn
in (tel1, tel2, tel3)
) pvt
;
You can replace contacts_xxx with your appropriate table names.
Hope that helps.
--isa
-- step 1: delete data in destination that is outdated
--select *
delete
from dbo.contacts_destination
where DateToContact < DATEDIFF(day, 0, getdate())
;
-- step 2: insert new data
with d
as
(
-- 2a: get list of existing phone numbers;
-- step 1 took care of removing those that are not valid.
select Tel
from dbo.contacts_destination
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
)
, s
as
(
-- 2b: get list of new phone numbers that meet criteria,
-- but also don't exist in the list of existing numbers.
select ContactName, Tel, TelColumn, DateToContact
from (
select *
from dbo.contacts_source
where DateToContact >= DATEDIFF(day, 0, getdate())
and DateToContact < DATEDIFF(day, 0, getdate())+14
) tbl
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
where not exists (select 1 from d where d.Tel = upvt.Tel)
)
-- 2c: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s
pivot (
max(Tel)
for TelColumn
in (tel1, tel2, tel3)
) pvt
;
You can replace contacts_xxx with your appropriate table names.
Hope that helps.
--isa
Of course, you have to remove duplicities from #mydata as the first step...
Exactly and in case I forgot to show that above, that was the point of unpivoting so that you had one column on which to group distinct by telephone number or use row_number() over() to select the ContactName and/or DateToContact you want if a duplicate is found.
Hmm goood Kevin, I supposed UNPIVOT is some theory or idea but you posted complete solution...
Yeah, I had time on my hands and couldn't help but test out my theory and this was working code -- up to group by which I removed so I could ensure everything was working right. :)
ASKER
Thanks for the comments guys. I will test out your suggestions this morning. I have a feeling that the problem with Pavel's solution will be removing the duplicates from #mydata before the insert. A simple select distinct doesn't work because there are duplicates like this:
Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123
I only want to insert one of the above records. I am only interested in making sure there are no duplicate telephone numbers.
I am definitely using SQl server 2005 so I can try out the UNPIVOT theory as well.
Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123
I only want to insert one of the above records. I am only interested in making sure there are no duplicate telephone numbers.
I am definitely using SQl server 2005 so I can try out the UNPIVOT theory as well.
Above duplicity is let say clear but what should be removed in the following case:
Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,123
Is one duplicity enough to remove one of them or not?
Also another question rised: How do you count the DateToContact in duplicate entries? Does it have some impact to removal decision?
Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,123
Is one duplicity enough to remove one of them or not?
Also another question rised: How do you count the DateToContact in duplicate entries? Does it have some impact to removal decision?
Here is updated example with dedupe method I mentioned using row_number().
-- step 2: insert new data
with d
as
(
-- 2a: get list of existing phone numbers;
-- step 1 took care of removing those that are not valid.
select Tel
from dbo.contacts_destination
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
)
, s
as
(
-- 2b: get list of new phone numbers that meet criteria,
-- but also don't exist in the list of existing numbers.
select ContactName, Tel, TelColumn, DateToContact
from (
select *
from dbo.contacts_source
where DateToContact >= DATEDIFF(day, 0, getdate())
and DateToContact < DATEDIFF(day, 0, getdate())+14
) tbl
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
where not exists (select 1 from d where d.Tel = upvt.Tel)
)
, s_dedupas( -- 2c: dedupe the new data. select ContactName, Tel, TelColumn, DateToContact from ( select *, ROW_NUMBER() OVER(PARTITION BY Tel ORDER BY DateToContact) rn from s ) derived where rn = 1)
-- 2d: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s_dedup
pivot (
max(Tel)
for TelColumn
in (tel1, tel2, tel3)
) pvt
;
For a better explanation of the dedupe technique, see this article:
Analytical SQL : Where do you rank? - http:A_1555.html
The Tel can be treated like the sales rep in the article and the top 1 customer is the one with the earliest DateToContact. So back in your terms, you are ranking the telephone number found by its contact date. If a later contact ends up with no phone numbers, then they won't be imported thus eliminating unneeded rows. Otherwise, you should see the subsequent rows have NULLs where that telephone number already appeared in another contact.
Hope that makes sense.
--isa
-- step 2: insert new data
with d
as
(
-- 2a: get list of existing phone numbers;
-- step 1 took care of removing those that are not valid.
select Tel
from dbo.contacts_destination
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
)
, s
as
(
-- 2b: get list of new phone numbers that meet criteria,
-- but also don't exist in the list of existing numbers.
select ContactName, Tel, TelColumn, DateToContact
from (
select *
from dbo.contacts_source
where DateToContact >= DATEDIFF(day, 0, getdate())
and DateToContact < DATEDIFF(day, 0, getdate())+14
) tbl
unpivot (
Tel
for TelColumn
in (tel1, tel2, tel3)
) upvt
where not exists (select 1 from d where d.Tel = upvt.Tel)
)
, s_dedupas( -- 2c: dedupe the new data. select ContactName, Tel, TelColumn, DateToContact from ( select *, ROW_NUMBER() OVER(PARTITION BY Tel ORDER BY DateToContact) rn from s ) derived where rn = 1)
-- 2d: final selection, pivoting back to 5 column format for insert
insert into dbo.contacts_destination
select ContactName, tel1, tel2, tel3, DateToContact
from s_dedup
pivot (
max(Tel)
for TelColumn
in (tel1, tel2, tel3)
) pvt
;
For a better explanation of the dedupe technique, see this article:
Analytical SQL : Where do you rank? - http:A_1555.html
The Tel can be treated like the sales rep in the article and the top 1 customer is the one with the earliest DateToContact. So back in your terms, you are ranking the telephone number found by its contact date. If a later contact ends up with no phone numbers, then they won't be imported thus eliminating unneeded rows. Otherwise, you should see the subsequent rows have NULLs where that telephone number already appeared in another contact.
Hope that makes sense.
--isa
That is just my understanding. As you can tell by Pavel's questions that there can be other interpretations.
My results would look like this:
Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,NULL
My results would look like this:
Contact,Tel1,Tel2,Tel3
Jim,123,888,999
Jack,444,555,NULL
ASKER
My curent (very slow) solution would remove the entire row from the example:
Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123
The first record inserted would be the one that makes it into the table. I hadn't thought about just replacing the duplicate numbers with NULL. I am not sure whether the customer would prefer to keep them I will have to ask.
I have tested both of your original suggestions but both seem to result in there being duplicates in the table.
Kevin's updated solution works perfectly as far as I can tell, I just wish I understood why. I will run some test against the full data set as the only issue I have found is that it seems to slow up considerably as the amount of data in the destination table increases. I'll be back to let you know what I find out. Thanks for all the input.
Contact,Tel1,Tel2,Tel3
Jim,123,456,789
Jack,789,456,123
The first record inserted would be the one that makes it into the table. I hadn't thought about just replacing the duplicate numbers with NULL. I am not sure whether the customer would prefer to keep them I will have to ask.
I have tested both of your original suggestions but both seem to result in there being duplicates in the table.
Kevin's updated solution works perfectly as far as I can tell, I just wish I understood why. I will run some test against the full data set as the only issue I have found is that it seems to slow up considerably as the amount of data in the destination table increases. I'll be back to let you know what I find out. Thanks for all the input.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
What to say? Kevin's solution is perfect.
ASKER
Great solution, with a good explanation and some interesting ideas
Are you definitely using SQL Server 2005?
I have a thought that you could do the following:
+Remove records that don't meet the criteria of 14 days stated in question.
+Select current records that meet the criteria (can use a table variable or WITH - common table expression - to avoid temp table) and UNPIVOT so you get ContactName, DateToCall and Tel columns with three rows for each contact representing each number.
+UNPIVOT the existing data.
+Select the new data that does not exist in the unpivoted version of existing data and then PIVOT back for insert into table.
This would work with SQL Server 2005, so please confirm that and which pieces if any you need an example of.
Regards,
--isa