Link to home
Start Free TrialLog in
Avatar of DavidNPD
DavidNPD

asked on

Remove Duplicate rows in Sql

I need a little help with a sql script

I have a huge table of addresses with numerous duplicates, I need a script that will look at the table and if it see's more than one row with identical data it will delete any of the extra rows and only leave one row of that data.
My table name is "Person_Address", I want the script to look at the "master_name_key","house_number" and "Street" field. If there is more than one row with the identical three peices of data (or identical nulls) in all three fields then delete all the extras leaving me just one.
Avatar of Anthony Perkins
Anthony Perkins
Flag of United States of America image

Please maintain these open questions:
Logon Scirpts not mapping drives in XP Date: 03/17/2002 06:29AM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=winxp&qid=20277927
Outlook on a SLOW network connection Date: 03/28/2002 08:12PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=msoutlook&qid=20282814
Crystal Reports RDC making ADO connection in VB Date: 04/17/2002 07:38PM PST  
https://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20290747
Formatting dates in Crystal Dictionaries Date: 04/28/2002 03:04PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20294676
Exporting crystal to a PDF file from a VB Application Date: 04/29/2002 01:02PM PST
https://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20295092

Thanks,
Anthony

 
I can see two approaches.  

(1) Create a statement that finds and deletes one duplicate.  Loop thru it repeatedly until no more duplicates exist.  For example:

SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
    DELETE Person_Address
    FROM Person_Address
    INNER JOIN
    (SELECT TOP 1 master_key_name, house_number, street
    FROM Person_Address
    GROUP BY master_key_name, house_number, street
    HAVING COUNT(*) > 1)
    AS dup ON person_address.master_key_name = dup.master_key_name AND
              person_address.house_number = dup.house_number AND
              person_address.street = dup.street
    IF @@ROWCOUNT = 0
        BREAK
END --WHILE

(2) Create a cursor to find the duplicates.  Fetch thru the cursor deleting the duplicates.  For example:
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(??) --match to orig col
DECLARE @house_number VARCHAR(??) --match to orig col
DECLARE @street VARCHAR(??) --match to orig col


Oops, here is the full version of (2):

(2)
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(5) --match to orig col
DECLARE @house_number VARCHAR(5) --match to orig col
DECLARE @street VARCHAR(5) --match to orig col
DECLARE @num_dups INT
FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
WHILE @@FETCH_STATUS <> -1
BEGIN
    SET @num_dups = @num_dups - 1  --delete all dups except 1
    SET ROWCOUNT @num_dups
    DELETE FROM Person_Address
    WHERE master_key_name = @master_key_name
    AND house_number = @house_number
    AND street = @street
    FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
END --WHILE

If this is a very large table and/or there are a lot of duplicates, the second method will probably perform better.  If you don't have an INDEX on those columns, for a large table it would probably be worth creating one just while running the DELETE process.
Avatar of DavidNPD
DavidNPD

ASKER

Scott, Thanks for the info. I'm not really good, with SQL just learning how to do some basic scripts. Can I run this as a script through Query Analyzer?
Yes, you should be able to run either from Query Analyzer.  Of course be sure to run it on a test system and back up the table first just in case.
Great I will run it tonight and let you know
As I said, be sure to run a test first on a test server.  And before the test, back up the test table.

DEFINITELY back up the production table before attempting to delete any rows.
Scott I am having a little trouble with both of these. I created a new table named person)address2 to test with. I changed the appropritate table name in your scripts. When I run the first one I get this error:

Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.
Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.

I then took the table name out between the delete and the from statement and ran it again and got:

Server: Msg 156, Level 15, State 1, Line 6
Incorrect syntax near the keyword 'INNER'.
Server: Msg 156, Level 15, State 1, Line 11
Incorrect syntax near the keyword 'AS'.

I then tried the second one, when I ran that I got this error:


Server: Msg 208, Level 16, State 1, Line 2
Invalid object name 'person_address2'.

which is strange because the table name is not in the second line. Any idea what these errors could be?

It looks as if it can't find the new test table you created.  Make sure that you are in the same db in which the table was created and that the owner of the table is 'dbo' (or that you specify the table as owner.table in the SQL commands).
The owner is most likely the issue. All my tables are dba. I will change the table name in your scripts to dba.person_address2, that should solve it right? I will check tonight.
ASKER CERTIFIED SOLUTION
Avatar of miron
miron
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
well,

there is always one more way to do the same thing.

select master_key_name, house_number, street into temp_copy from Person_Address
group by master_key_name, house_number, street

execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'

execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
Miron, I am a little confused what will your script do?
rather then removing ( sic ) duplicates one by one, it will create set of unique records using some temporary name, then rename old set of records ( table ) sonsiting of some unique records and some duplicates, to some other name, and finally, rename the set of records free of duplicates to the name appropriate.

It's good, is that you are not loosing original set of records, but replacing it with set of unique ( what you were looking for ) records and at the same time keep completely old records un-touched, sort of a backup.
This is the weirdest thing, If I run your script against a test table with hundreds of rows pasted in it works fine. If I try to run it against my actual table though it comes up with no errors but doesnt effect any rows. I have imported the data from my good table to my test table and ran the script again. At that point I run the query and it only effects 5 rows. This is out of 105k. I can still see hundreds of rows that should have been done. If I run the script again it then shows "0 Rows Effected" I then copy and paste a row to make sure they are identical and run the script again, still it shows "0 Rows Effected". Any idea what I am doing wrong?

Below is the code I am using now with the correct field names


SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
   DELETE dba.person_address3
   FROM dba.person_address3
   INNER JOIN
   (SELECT TOP 1 master_name_key, house_no, street
   FROM dba.person_address3
   GROUP BY master_name_key, house_no, street
   HAVING COUNT(*) > 1)
   AS dup ON dba.person_address3.master_name_key = dup.master_name_key AND
             dba.person_address3.house_no = dup.house_no AND
             dba.person_address3.street = dup.street
   IF @@ROWCOUNT = 0
       BREAK
END --WHILE
OK Miron, that seemed to work pretty good, I just need to rewrite it a little bit to take in the Town State and Zips, If I am creating a new table I will lose that. I will let you know.
include Town, State, and Zip columns, expected some troubles with NULL values.

-- create unique key for each row
alter table Person_Address add record_id int identity( 1, 1 )
-- make NULL to be equael to NULL
SET ANSY_NULLS ON

select
   master_key_name
  , house_number
  , street
  , Town
  , State
  , Zip
into temp_copy
from
Person_Address where record_id in
(
select max( record_id )
from Person_Address
-- --optional where clause will fend off NULL values
-- where
-- master_key_name is not NULL
-- OR house_number is not NULL
-- OR street is not  NULL
group by master_key_name, house_number, street
)

alter table Person_Address drop column record_id
execute  sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'
execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'


Miron, this seems to be working very well, and I like the idea of creating a new table, thanks.

Peter thanks for your help also.