DavidNPD
asked on
Remove Duplicate rows in Sql
I need a little help with a sql script
I have a huge table of addresses with numerous duplicates, I need a script that will look at the table and if it see's more than one row with identical data it will delete any of the extra rows and only leave one row of that data.
My table name is "Person_Address", I want the script to look at the "master_name_key","house_n umber" and "Street" field. If there is more than one row with the identical three peices of data (or identical nulls) in all three fields then delete all the extras leaving me just one.
I have a huge table of addresses with numerous duplicates, I need a script that will look at the table and if it see's more than one row with identical data it will delete any of the extra rows and only leave one row of that data.
My table name is "Person_Address", I want the script to look at the "master_name_key","house_n
I can see two approaches.
(1) Create a statement that finds and deletes one duplicate. Loop thru it repeatedly until no more duplicates exist. For example:
SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
DELETE Person_Address
FROM Person_Address
INNER JOIN
(SELECT TOP 1 master_key_name, house_number, street
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1)
AS dup ON person_address.master_key_ name = dup.master_key_name AND
person_address.house_numbe r = dup.house_number AND
person_address.street = dup.street
IF @@ROWCOUNT = 0
BREAK
END --WHILE
(2) Create a cursor to find the duplicates. Fetch thru the cursor deleting the duplicates. For example:
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(??) --match to orig col
DECLARE @house_number VARCHAR(??) --match to orig col
DECLARE @street VARCHAR(??) --match to orig col
(1) Create a statement that finds and deletes one duplicate. Loop thru it repeatedly until no more duplicates exist. For example:
SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
DELETE Person_Address
FROM Person_Address
INNER JOIN
(SELECT TOP 1 master_key_name, house_number, street
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1)
AS dup ON person_address.master_key_
person_address.house_numbe
person_address.street = dup.street
IF @@ROWCOUNT = 0
BREAK
END --WHILE
(2) Create a cursor to find the duplicates. Fetch thru the cursor deleting the duplicates. For example:
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(??) --match to orig col
DECLARE @house_number VARCHAR(??) --match to orig col
DECLARE @street VARCHAR(??) --match to orig col
Oops, here is the full version of (2):
(2)
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(5) --match to orig col
DECLARE @house_number VARCHAR(5) --match to orig col
DECLARE @street VARCHAR(5) --match to orig col
DECLARE @num_dups INT
FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
WHILE @@FETCH_STATUS <> -1
BEGIN
SET @num_dups = @num_dups - 1 --delete all dups except 1
SET ROWCOUNT @num_dups
DELETE FROM Person_Address
WHERE master_key_name = @master_key_name
AND house_number = @house_number
AND street = @street
FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
END --WHILE
If this is a very large table and/or there are a lot of duplicates, the second method will probably perform better. If you don't have an INDEX on those columns, for a large table it would probably be worth creating one just while running the DELETE process.
(2)
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(5) --match to orig col
DECLARE @house_number VARCHAR(5) --match to orig col
DECLARE @street VARCHAR(5) --match to orig col
DECLARE @num_dups INT
FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
WHILE @@FETCH_STATUS <> -1
BEGIN
SET @num_dups = @num_dups - 1 --delete all dups except 1
SET ROWCOUNT @num_dups
DELETE FROM Person_Address
WHERE master_key_name = @master_key_name
AND house_number = @house_number
AND street = @street
FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
END --WHILE
If this is a very large table and/or there are a lot of duplicates, the second method will probably perform better. If you don't have an INDEX on those columns, for a large table it would probably be worth creating one just while running the DELETE process.
ASKER
Scott, Thanks for the info. I'm not really good, with SQL just learning how to do some basic scripts. Can I run this as a script through Query Analyzer?
Yes, you should be able to run either from Query Analyzer. Of course be sure to run it on a test system and back up the table first just in case.
ASKER
Great I will run it tonight and let you know
As I said, be sure to run a test first on a test server. And before the test, back up the test table.
DEFINITELY back up the production table before attempting to delete any rows.
DEFINITELY back up the production table before attempting to delete any rows.
ASKER
Scott I am having a little trouble with both of these. I created a new table named person)address2 to test with. I changed the appropritate table name in your scripts. When I run the first one I get this error:
Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.
Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.
I then took the table name out between the delete and the from statement and ran it again and got:
Server: Msg 156, Level 15, State 1, Line 6
Incorrect syntax near the keyword 'INNER'.
Server: Msg 156, Level 15, State 1, Line 11
Incorrect syntax near the keyword 'AS'.
I then tried the second one, when I ran that I got this error:
Server: Msg 208, Level 16, State 1, Line 2
Invalid object name 'person_address2'.
which is strange because the table name is not in the second line. Any idea what these errors could be?
Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.
Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.
I then took the table name out between the delete and the from statement and ran it again and got:
Server: Msg 156, Level 15, State 1, Line 6
Incorrect syntax near the keyword 'INNER'.
Server: Msg 156, Level 15, State 1, Line 11
Incorrect syntax near the keyword 'AS'.
I then tried the second one, when I ran that I got this error:
Server: Msg 208, Level 16, State 1, Line 2
Invalid object name 'person_address2'.
which is strange because the table name is not in the second line. Any idea what these errors could be?
It looks as if it can't find the new test table you created. Make sure that you are in the same db in which the table was created and that the owner of the table is 'dbo' (or that you specify the table as owner.table in the SQL commands).
ASKER
The owner is most likely the issue. All my tables are dba. I will change the table name in your scripts to dba.person_address2, that should solve it right? I will check tonight.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
well,
there is always one more way to do the same thing.
select master_key_name, house_number, street into temp_copy from Person_Address
group by master_key_name, house_number, street
execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'
execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
there is always one more way to do the same thing.
select master_key_name, house_number, street into temp_copy from Person_Address
group by master_key_name, house_number, street
execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'
execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
ASKER
Miron, I am a little confused what will your script do?
rather then removing ( sic ) duplicates one by one, it will create set of unique records using some temporary name, then rename old set of records ( table ) sonsiting of some unique records and some duplicates, to some other name, and finally, rename the set of records free of duplicates to the name appropriate.
It's good, is that you are not loosing original set of records, but replacing it with set of unique ( what you were looking for ) records and at the same time keep completely old records un-touched, sort of a backup.
It's good, is that you are not loosing original set of records, but replacing it with set of unique ( what you were looking for ) records and at the same time keep completely old records un-touched, sort of a backup.
ASKER
This is the weirdest thing, If I run your script against a test table with hundreds of rows pasted in it works fine. If I try to run it against my actual table though it comes up with no errors but doesnt effect any rows. I have imported the data from my good table to my test table and ran the script again. At that point I run the query and it only effects 5 rows. This is out of 105k. I can still see hundreds of rows that should have been done. If I run the script again it then shows "0 Rows Effected" I then copy and paste a row to make sure they are identical and run the script again, still it shows "0 Rows Effected". Any idea what I am doing wrong?
Below is the code I am using now with the correct field names
SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
DELETE dba.person_address3
FROM dba.person_address3
INNER JOIN
(SELECT TOP 1 master_name_key, house_no, street
FROM dba.person_address3
GROUP BY master_name_key, house_no, street
HAVING COUNT(*) > 1)
AS dup ON dba.person_address3.master _name_key = dup.master_name_key AND
dba.person_address3.house_ no = dup.house_no AND
dba.person_address3.street = dup.street
IF @@ROWCOUNT = 0
BREAK
END --WHILE
Below is the code I am using now with the correct field names
SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
DELETE dba.person_address3
FROM dba.person_address3
INNER JOIN
(SELECT TOP 1 master_name_key, house_no, street
FROM dba.person_address3
GROUP BY master_name_key, house_no, street
HAVING COUNT(*) > 1)
AS dup ON dba.person_address3.master
dba.person_address3.house_
dba.person_address3.street
IF @@ROWCOUNT = 0
BREAK
END --WHILE
ASKER
OK Miron, that seemed to work pretty good, I just need to rewrite it a little bit to take in the Town State and Zips, If I am creating a new table I will lose that. I will let you know.
include Town, State, and Zip columns, expected some troubles with NULL values.
-- create unique key for each row
alter table Person_Address add record_id int identity( 1, 1 )
-- make NULL to be equael to NULL
SET ANSY_NULLS ON
select
master_key_name
, house_number
, street
, Town
, State
, Zip
into temp_copy
from
Person_Address where record_id in
(
select max( record_id )
from Person_Address
-- --optional where clause will fend off NULL values
-- where
-- master_key_name is not NULL
-- OR house_number is not NULL
-- OR street is not NULL
group by master_key_name, house_number, street
)
alter table Person_Address drop column record_id
execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'
execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
-- create unique key for each row
alter table Person_Address add record_id int identity( 1, 1 )
-- make NULL to be equael to NULL
SET ANSY_NULLS ON
select
master_key_name
, house_number
, street
, Town
, State
, Zip
into temp_copy
from
Person_Address where record_id in
(
select max( record_id )
from Person_Address
-- --optional where clause will fend off NULL values
-- where
-- master_key_name is not NULL
-- OR house_number is not NULL
-- OR street is not NULL
group by master_key_name, house_number, street
)
alter table Person_Address drop column record_id
execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'
execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
ASKER
Miron, this seems to be working very well, and I like the idea of creating a new table, thanks.
Peter thanks for your help also.
Peter thanks for your help also.
Logon Scirpts not mapping drives in XP Date: 03/17/2002 06:29AM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=winxp&qid=20277927
Outlook on a SLOW network connection Date: 03/28/2002 08:12PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=msoutlook&qid=20282814
Crystal Reports RDC making ADO connection in VB Date: 04/17/2002 07:38PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20290747
Formatting dates in Crystal Dictionaries Date: 04/28/2002 03:04PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20294676
Exporting crystal to a PDF file from a VB Application Date: 04/29/2002 01:02PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20295092
Thanks,
Anthony