Solved

Remove Duplicate rows in Sql

Posted on 2002-06-26
18
604 Views
Last Modified: 2008-02-01
I need a little help with a sql script

I have a huge table of addresses with numerous duplicates, I need a script that will look at the table and if it see's more than one row with identical data it will delete any of the extra rows and only leave one row of that data.
My table name is "Person_Address", I want the script to look at the "master_name_key","house_number" and "Street" field. If there is more than one row with the identical three peices of data (or identical nulls) in all three fields then delete all the extras leaving me just one.
0
Comment
Question by:DavidNPD
  • 8
  • 5
  • 4
  • +1
18 Comments
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 7112677
Please maintain these open questions:
Logon Scirpts not mapping drives in XP Date: 03/17/2002 06:29AM PST
http://www.experts-exchange.com/jsp/qShow.jsp?ta=winxp&qid=20277927
Outlook on a SLOW network connection Date: 03/28/2002 08:12PM PST
http://www.experts-exchange.com/jsp/qShow.jsp?ta=msoutlook&qid=20282814
Crystal Reports RDC making ADO connection in VB Date: 04/17/2002 07:38PM PST  
http://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20290747
Formatting dates in Crystal Dictionaries Date: 04/28/2002 03:04PM PST
http://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20294676
Exporting crystal to a PDF file from a VB Application Date: 04/29/2002 01:02PM PST
http://www.experts-exchange.com/jsp/qShow.jsp?ta=crystal&qid=20295092

Thanks,
Anthony

 
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 7113591
I can see two approaches.  

(1) Create a statement that finds and deletes one duplicate.  Loop thru it repeatedly until no more duplicates exist.  For example:

SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
    DELETE Person_Address
    FROM Person_Address
    INNER JOIN
    (SELECT TOP 1 master_key_name, house_number, street
    FROM Person_Address
    GROUP BY master_key_name, house_number, street
    HAVING COUNT(*) > 1)
    AS dup ON person_address.master_key_name = dup.master_key_name AND
              person_address.house_number = dup.house_number AND
              person_address.street = dup.street
    IF @@ROWCOUNT = 0
        BREAK
END --WHILE

(2) Create a cursor to find the duplicates.  Fetch thru the cursor deleting the duplicates.  For example:
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(??) --match to orig col
DECLARE @house_number VARCHAR(??) --match to orig col
DECLARE @street VARCHAR(??) --match to orig col


0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 7113597
Oops, here is the full version of (2):

(2)
DECLARE csrDupAddr CURSOR FOR
SELECT master_key_name, house_number, street, COUNT(*)
FROM Person_Address
GROUP BY master_key_name, house_number, street
HAVING COUNT(*) > 1
FOR READ ONLY
DECLARE @master_key_name VARCHAR(5) --match to orig col
DECLARE @house_number VARCHAR(5) --match to orig col
DECLARE @street VARCHAR(5) --match to orig col
DECLARE @num_dups INT
FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
WHILE @@FETCH_STATUS <> -1
BEGIN
    SET @num_dups = @num_dups - 1  --delete all dups except 1
    SET ROWCOUNT @num_dups
    DELETE FROM Person_Address
    WHERE master_key_name = @master_key_name
    AND house_number = @house_number
    AND street = @street
    FETCH NEXT FROM csrDupAddr INTO @master_key_name, @house_number, @street, @num_dups
END --WHILE

If this is a very large table and/or there are a lot of duplicates, the second method will probably perform better.  If you don't have an INDEX on those columns, for a large table it would probably be worth creating one just while running the DELETE process.
0
 

Author Comment

by:DavidNPD
ID: 7114676
Scott, Thanks for the info. I'm not really good, with SQL just learning how to do some basic scripts. Can I run this as a script through Query Analyzer?
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 7114719
Yes, you should be able to run either from Query Analyzer.  Of course be sure to run it on a test system and back up the table first just in case.
0
 

Author Comment

by:DavidNPD
ID: 7114738
Great I will run it tonight and let you know
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 7114852
As I said, be sure to run a test first on a test server.  And before the test, back up the test table.

DEFINITELY back up the production table before attempting to delete any rows.
0
 

Author Comment

by:DavidNPD
ID: 7115523
Scott I am having a little trouble with both of these. I created a new table named person)address2 to test with. I changed the appropritate table name in your scripts. When I run the first one I get this error:

Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.
Server: Msg 208, Level 16, State 1, Line 4
Invalid object name 'person_address2'.

I then took the table name out between the delete and the from statement and ran it again and got:

Server: Msg 156, Level 15, State 1, Line 6
Incorrect syntax near the keyword 'INNER'.
Server: Msg 156, Level 15, State 1, Line 11
Incorrect syntax near the keyword 'AS'.

I then tried the second one, when I ran that I got this error:


Server: Msg 208, Level 16, State 1, Line 2
Invalid object name 'person_address2'.

which is strange because the table name is not in the second line. Any idea what these errors could be?

0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 7117022
It looks as if it can't find the new test table you created.  Make sure that you are in the same db in which the table was created and that the owner of the table is 'dbo' (or that you specify the table as owner.table in the SQL commands).
0
Zoho SalesIQ

Hassle-free live chat software re-imagined for business growth. 2 users, always free.

 

Author Comment

by:DavidNPD
ID: 7117306
The owner is most likely the issue. All my tables are dba. I will change the table name in your scripts to dba.person_address2, that should solve it right? I will check tonight.
0
 
LVL 9

Accepted Solution

by:
miron earned 50 total points
ID: 7119572
well,

there is always one more way to do the same thing.

select master_key_name, house_number, street into temp_copy from Person_Address
group by master_key_name, house_number, street

execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'

execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
0
 
LVL 9

Expert Comment

by:miron
ID: 7119672
well,

there is always one more way to do the same thing.

select master_key_name, house_number, street into temp_copy from Person_Address
group by master_key_name, house_number, street

execute sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'

execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'
0
 

Author Comment

by:DavidNPD
ID: 7122784
Miron, I am a little confused what will your script do?
0
 
LVL 9

Expert Comment

by:miron
ID: 7122910
rather then removing ( sic ) duplicates one by one, it will create set of unique records using some temporary name, then rename old set of records ( table ) sonsiting of some unique records and some duplicates, to some other name, and finally, rename the set of records free of duplicates to the name appropriate.

It's good, is that you are not loosing original set of records, but replacing it with set of unique ( what you were looking for ) records and at the same time keep completely old records un-touched, sort of a backup.
0
 

Author Comment

by:DavidNPD
ID: 7122937
This is the weirdest thing, If I run your script against a test table with hundreds of rows pasted in it works fine. If I try to run it against my actual table though it comes up with no errors but doesnt effect any rows. I have imported the data from my good table to my test table and ran the script again. At that point I run the query and it only effects 5 rows. This is out of 105k. I can still see hundreds of rows that should have been done. If I run the script again it then shows "0 Rows Effected" I then copy and paste a row to make sure they are identical and run the script again, still it shows "0 Rows Effected". Any idea what I am doing wrong?

Below is the code I am using now with the correct field names


SET ROWCOUNT 1
WHILE 1 = 1
BEGIN
   DELETE dba.person_address3
   FROM dba.person_address3
   INNER JOIN
   (SELECT TOP 1 master_name_key, house_no, street
   FROM dba.person_address3
   GROUP BY master_name_key, house_no, street
   HAVING COUNT(*) > 1)
   AS dup ON dba.person_address3.master_name_key = dup.master_name_key AND
             dba.person_address3.house_no = dup.house_no AND
             dba.person_address3.street = dup.street
   IF @@ROWCOUNT = 0
       BREAK
END --WHILE
0
 

Author Comment

by:DavidNPD
ID: 7123000
OK Miron, that seemed to work pretty good, I just need to rewrite it a little bit to take in the Town State and Zips, If I am creating a new table I will lose that. I will let you know.
0
 
LVL 9

Expert Comment

by:miron
ID: 7124043
include Town, State, and Zip columns, expected some troubles with NULL values.

-- create unique key for each row
alter table Person_Address add record_id int identity( 1, 1 )
-- make NULL to be equael to NULL
SET ANSY_NULLS ON

select
   master_key_name
  , house_number
  , street
  , Town
  , State
  , Zip
into temp_copy
from
Person_Address where record_id in
(
select max( record_id )
from Person_Address
-- --optional where clause will fend off NULL values
-- where
-- master_key_name is not NULL
-- OR house_number is not NULL
-- OR street is not  NULL
group by master_key_name, house_number, street
)

alter table Person_Address drop column record_id
execute  sp_rename 'Person_Address', 'old_Person_Address', 'OBJECT'
execute sp_rename 'temp_copy', 'Person_Address', 'OBJECT'


0
 

Author Comment

by:DavidNPD
ID: 7127159
Miron, this seems to be working very well, and I like the idea of creating a new table, thanks.

Peter thanks for your help also.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now