Solved

Deleting duplicate rows.

Posted on 2006-11-27
8
870 Views
Last Modified: 2008-10-09
There is one table that could have duplicated rows.  The only field on each row that would differ between duplicates would be field that contains a date.

I would like to delete the duplicates but make sure that the dupe that gets deleted is the earlier entry.

How could i achieve this?

Wing
0
Comment
Question by:WingYip
  • 4
  • 3
8 Comments
 
LVL 11

Expert Comment

by:donaldmaloney
Comment Utility
Wing,

One way there are others.
 Im assuming you dont have unique indexes.

Create a temporary table.
OPen your table with dups.
move to the last record.
copy record to temp.
loop table till BOF
get prev
check table recordwith Temp  
if <> move to temp else get prev
etc.
Don
0
 
LVL 1

Author Comment

by:WingYip
Comment Utility
Hoping to get a sql answer rather than code.  I dont have a problem with code but the person who needs an answer to this question can just about cope with sql.

Is this possible in sql (Access Sql that is)?

Wing
0
 
LVL 11

Expert Comment

by:donaldmaloney
Comment Utility
wing,
Is there a unique ID that identifies the order of entry?
>>>but make sure that the dupe that gets deleted is the earlier entry.<<
don
0
 
LVL 11

Expert Comment

by:donaldmaloney
Comment Utility
Also,
are any of the fields  memo fields?
Don
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 11

Expert Comment

by:donaldmaloney
Comment Utility
OK re reading 1st post.
are all dates unique for any duplicate data?
Don
0
 
LVL 1

Author Comment

by:WingYip
Comment Utility
Yes dates would be unique.  Key fields may not be.

Wing
0
 
LVL 3

Accepted Solution

by:
taycuong76 earned 100 total points
Comment Utility
Here are a few pointers that can help you find the duplicate values in a table and delete those pesky pieces of information.

Lets say you have a table called Employees and its layout is as follows:

create table employee
(
  EmpId number,
  EmpName varchar2(10),
  EmpSSN varchar2(11)
);

insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (7, 'Jack', '555-55-5555');
insert into employee values (8, 'Mike', '555-58-5555');
insert into employee values (9, 'Cathy', '555-59-5555');
insert into employee values (10, 'Lisa', '555-70-5555');
insert into employee values (11, 'Lisa', '555-70-5555');
 

You can see from the data inserted above that the EmpId is unique but the EmpName and EmpSSN have duplicate values, so now you ask yourself, how in the world can I find these duplicates and remove them?

Look no further as I will show you below one way of solving this problem.

First off you will need to get an idea of what records are duplicate and by running the query below it will help you figure that out:


SQL> select count(empssn), empssn from employee
  2    group by empssn
  3      having count(empssn) > 1;

COUNT(EMPSSN) EMPSSN
------------- -----------
            2 555-55-5555
            2 555-58-5555
            2 555-59-5555
            3 555-70-5555
From the above output you can now tell what employees have duplicate records and now you can move onto the next step and remove them but still retain a unique record for the duplicate employees.
Again here is a list all of the records in the table before we remove them.

SQL> select * from employee;

     EMPID EMPNAME    EMPSSN
---------- ---------- -----------
         1 Jack       555-55-5555
         2 Joe        555-56-5555
         3 Fred       555-57-5555
         4 Mike       555-58-5555
         5 Cathy      555-59-5555
         6 Lisa       555-70-5555
         7 Jack       555-55-5555
         8 Mike       555-58-5555
         9 Cathy      555-59-5555
        10 Lisa       555-70-5555
        11 Lisa       555-70-5555
Now, let's get rid of those duplicate records by running this query:
delete from employee
  where (empid, empssn)
    not in
    ( select min(empid), empssn
         from employee group by empssn);
After the above query has been executed lets select all the records from the table to see if indeed it actually worked:
SQL> select * from employee;

     EMPID EMPNAME    EMPSSN
---------- ---------- -----------
         1 Jack       555-55-5555
         2 Joe        555-56-5555
         3 Fred       555-57-5555
         4 Mike       555-58-5555
         5 Cathy      555-59-5555
         6 Lisa       555-70-5555

6 rows selected.
BAM! Problem solved, no more duplicates, now finally you can go to bed and get some sleep. WAIT! Just before you do, let me throw a little wrench into the situation, lets say that you have duplicate records once again but this time the empids and empSSNs are the same.
For example the data now looks like this:

insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (1, 'Jack', '555-55-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6 ,'Lisa', '555-70-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
Now, how can you figure out what data duplicate as now there is no unique identifier in the table. The Ids are the same and the SSN are the same. This is just great!
No problem, relax, there is actually a distinct value in the table but you can't see it. This little helper is called the rowid.

Remember there is nothing to uniquely identify the data as you can see by the above insert statements so you will now have to rely on the rowid to get us out of this mess. In order to delete the duplicate values in the employee table you will now need to write a query like this:

 
delete from employee
   where (rowid, empssn)
    not in
     (select min(rowid), empssn from employee group by empssn);
And hence, after the delete query above has been executed your data will appear as:
SQL> select * from employee
  2  /

     EMPID EMPNAME    EMPSSN
---------- ---------- -----------
         1 Jack       555-55-5555
         2 Joe        555-56-5555
         3 Fred       555-57-5555
         4 Mike       555-58-5555
         5 Cathy      555-59-5555
         6 Lisa       555-70-5555

6 rows selected.
As you can see all the duplicate values have been removed from the table and now you finally have unique records in the table, so before you call it quits or before someone else slips in and enters more duplicate data make sure you add a primary key to that EMPID or EMPSSN column so this wont happen again!
Now finally, you will be able to get that long awaited sleep!

IMPORTANT NOTE: Remember rowid will always uniquely identify a row in a table but please do not use this as a primary key. I only showed you this little trick to help you get out of a bind when the database is poorly designed by allowing duplicate data to be entered.
===============
The only way you can automatically remove duplicates is if you have a unique key (that perferably also is chronological). In this case you don't so it is not gonna be easy. You could add an autonumber field. That would be unique and could be used to deleted the duplicates.

The trouble is that if you would select the records to be deleted by the duplicate fields, that all records with that address would be deleted.

If you have a unique key you can prevent the record you want kept to be removed.

OK, now for an example. First create a query that will select the records with duplicates (which actually selects the duplicates you want to keep):

SELECT [Street Address], Max(tblAddress.ID) AS ID
FROM tblAddress
GROUP BY [Street Address]
HAVING (((Count(*))>1));

This selects all duplicate Street Address records. I added the autonumber field and named it ID. The query selects the MAX of the duplicate ID's. So the last one is kept. (If you want to keep the oldest record change it to MIN.

Now save the query as "qryFindDuplicates" and create the following query that will delete the other duplicates:

DELETE tblAddress.ID
FROM tblAddress INNER JOIN qryFindDuplicates ON tblAddress.[Street Address] = qryFindDuplicates.[Street Address]
WHERE (((tblAddress.ID)<>[qryFindDuplicates].[ID]));


This will delete the duplicates, but leave the ones that were selected for keeping (from the previous query)
=============
 have often found that poor database design creates the need for inventive SQL solutions. In fact, the whole concept of "normalizing" databases sprang from eliminating certain "anomalies" when querying the database. When faced with a sub-optimal database design, the ideal solution from a technical standpoint is to redesign the database and correct its flaws. However, redesigning the database often is not an option due to time or budget constraints, pre-existing software, or simple lack of authority to change the design. We must often play the hand we are dealt. In this article, I'll show you how to leverage advanced join techniques to get the most from design-challenged databases.

Suppose the person that designed your database fell asleep in database theory class when the professor was discussing entity integrity. As a result, the schema lacks primary keys and is rife with duplicate data. Since its difficult to write a quality piece of software against a database with duplicate data, you decide to create primary keys on the tables. Your only problem is that you must first identify and eliminate the duplicate data.

There are several ways of finding and eliminating duplicate data; some are better suited to specific situations than others. The technique that I'll show you in this article is a relatively straightforward technique that should work in a wide variety of situations. To demonstrate the technique, I'm going to use the following table:


CREATE TABLE Payment(
      CustomerNumber      int            NOT NULL,
      PostedDatetime      datetime      NOT NULL,
      PaymentAmt      money            NOT NULL,
      PaymentMemo      varchar(200)      NOT NULL,
      CCRefNum      char(10)      NOT NULL
)


The first step to a duplicate-free table is coming up with a way of uniquely identifying each row. There's no way you can delete a particular row without being able to distinguish it from the others. The easiest way to accomplish this is to add an identity column to this table and make it the primary key.


ALTER TABLE Payment ADD
PaymentID int NOT NULL IDENTITY(1, 1),
      CONSTRAINT PK_Payment PRIMARY KEY(PaymentID)


Now that you have the ability to specify a particular row to delete, you must decide how to define "duplicate". Which columns should uniquely identify a row? In real-world scenarios, the answer will be based upon how the table has been used and the actual data contained in it; however, for this example, a combination of CustomerNumber, PostedDatetime, and PaymentAmt will serve as the logical key. It makes sense that the same customer would not post two payments of exactly the same amount at exactly the same time.

Having decided what a duplicate is, you need some way of identifying them. To identify the duplicate rows, you need to construct a query which returns each unique CustomerNumber / PostedDatetime / PaymentAmt combination where there is more than one row with the same key values.


SELECT
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
FROM
      Payment
GROUP BY
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
HAVING
      COUNT(*) > 1


This query groups by the key values to return one row per unique set of values. The HAVING clause then filters out all groups where there is only one row for that set of values. This leaves a rowset containing only those logical key values where there are duplicates.

The next problem to solve is how to eliminate all but one row per set of key values. The PaymentID column added earlier provides an answer. If we could somehow get exactly one PaymentID for each set of duplicates, we could delete all rows where the CustomerNumber / PostedDatetime / PaymentAmt is returned in the above query, but the PaymentID does not match. This would delete all but one row per set of duplicates. How can we accomplish this?

When grouping data, aggregate functions can be used to return data about the group as a whole. While some of the aggregate functions such as SUM() and AVG() return values which don't exist on any particular row in the group, the extrema functions MIN() and MAX() return values which exist in an actual row in the group. Hence, the extrema functions can be used to return an arbitrary value from some set of values.

Using this knowledge about MIN() and MAX(), we can modify the above query to return the CustomerNumber / PostedDatetime / PaymentAmt for each set of duplicates along with exactly one unique PaymentID.


SELECT
      MAX(PaymentID) AS PaymentID,
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
FROM
      Payment
GROUP BY
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
HAVING
      COUNT(*) > 1


This query returns all the data we need to perform the delete operation, but how do we tell the database engine to actually delete the records? We can't simply change the SELECT to DELETE because DELETE statements don't support grouping and aggregates. However, the DELETE statement does support joining to other tables. We can use a technique called a derived table to treat the above SELECT statement as if it were a table. We can then write a DELETE statement which joins to this derived table and performs the DELETE.


DELETE FROM
      p1
FROM
      Payment p1
INNER JOIN
      (
            SELECT
                  MAX(PaymentID) AS PaymentID,
                  CustomerNumber,
                  PostedDatetime,
                  PaymentAmt
            FROM
                  Payment
            GROUP BY
                  CustomerNumber,
                  PostedDatetime,
                  PaymentAmt
            HAVING
                  COUNT(*) > 1
      ) p2
      ON(
            p1.CustomerNumber = p2.CustomerNumber
            AND
            p1.PostedDatetime = p2.PostedDatetime
            AND
            p1.PaymentAmt = p2.PaymentAmt
            AND
            p1.PaymentID <> p2.PaymentID
      )


The most interesting part of this query is the join criteria. We join on CustomerNumber, PostedDatetime, and PaymentAmt to delete rows that match logically to the set of duplicates that we've found. If we left the criteria at this point, we'd end up deleting the duplicates, but we wouldn't have any copies of the duplicate rows remaining. Our requirements stated that we delete all the duplicates, but leave one copy of each duplicated row. To do this, we add the fourth criterion stating the PaymentID not match one of the PaymentID values in the set of duplicates. Thus, one row per set of duplicates is retained while deleting the others.

Hopefully, you'll never need to use these techniques to delete duplicate data. However, if you do, you won't have to struggle to find a technique that precisely and efficiently performs the task. Even if you don't use this technique to delete duplicates, many of the elements of the solution - derived tables, joined deletes, grouping and aggregation - can be applied to other situations that require inventive query techniques.


0
 
LVL 1

Author Comment

by:WingYip
Comment Utility
Thanks for the answer

Sorry about delay

Wing
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

In the article entitled Working with Objects – Part 1 (http://www.experts-exchange.com/Microsoft/Development/MS_Access/A_4942-Working-with-Objects-Part-1.html), you learned the basics of working with objects, properties, methods, and events. In Work…
Experts-Exchange is a great place to come for help with solutions for your database issues, and many problems are resolved within minutes of being posted.  Others take a little more time and effort and often providing a sample database is very helpf…
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
What’s inside an Access Desktop Database. Will look at the basic interface, Navigation Pane (Database Container), Tables, Queries, Forms, Report, Macro’s, and VBA code.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now