Solved

Deleting duplicate rows.

Posted on 2006-11-27
8
891 Views
Last Modified: 2008-10-09
There is one table that could have duplicated rows.  The only field on each row that would differ between duplicates would be field that contains a date.

I would like to delete the duplicates but make sure that the dupe that gets deleted is the earlier entry.

How could i achieve this?

Wing
0
Comment
Question by:WingYip
  • 4
  • 3
8 Comments
 
LVL 11

Expert Comment

by:donaldmaloney
ID: 18022696
Wing,

One way there are others.
 Im assuming you dont have unique indexes.

Create a temporary table.
OPen your table with dups.
move to the last record.
copy record to temp.
loop table till BOF
get prev
check table recordwith Temp  
if <> move to temp else get prev
etc.
Don
0
 
LVL 1

Author Comment

by:WingYip
ID: 18024053
Hoping to get a sql answer rather than code.  I dont have a problem with code but the person who needs an answer to this question can just about cope with sql.

Is this possible in sql (Access Sql that is)?

Wing
0
 
LVL 11

Expert Comment

by:donaldmaloney
ID: 18025006
wing,
Is there a unique ID that identifies the order of entry?
>>>but make sure that the dupe that gets deleted is the earlier entry.<<
don
0
Back Up Your Microsoft Windows Server®

Back up all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

 
LVL 11

Expert Comment

by:donaldmaloney
ID: 18025019
Also,
are any of the fields  memo fields?
Don
0
 
LVL 11

Expert Comment

by:donaldmaloney
ID: 18025027
OK re reading 1st post.
are all dates unique for any duplicate data?
Don
0
 
LVL 1

Author Comment

by:WingYip
ID: 18026247
Yes dates would be unique.  Key fields may not be.

Wing
0
 
LVL 3

Accepted Solution

by:
taycuong76 earned 100 total points
ID: 18026524
Here are a few pointers that can help you find the duplicate values in a table and delete those pesky pieces of information.

Lets say you have a table called Employees and its layout is as follows:

create table employee
(
  EmpId number,
  EmpName varchar2(10),
  EmpSSN varchar2(11)
);

insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (7, 'Jack', '555-55-5555');
insert into employee values (8, 'Mike', '555-58-5555');
insert into employee values (9, 'Cathy', '555-59-5555');
insert into employee values (10, 'Lisa', '555-70-5555');
insert into employee values (11, 'Lisa', '555-70-5555');
 

You can see from the data inserted above that the EmpId is unique but the EmpName and EmpSSN have duplicate values, so now you ask yourself, how in the world can I find these duplicates and remove them?

Look no further as I will show you below one way of solving this problem.

First off you will need to get an idea of what records are duplicate and by running the query below it will help you figure that out:


SQL> select count(empssn), empssn from employee
  2    group by empssn
  3      having count(empssn) > 1;

COUNT(EMPSSN) EMPSSN
------------- -----------
            2 555-55-5555
            2 555-58-5555
            2 555-59-5555
            3 555-70-5555
From the above output you can now tell what employees have duplicate records and now you can move onto the next step and remove them but still retain a unique record for the duplicate employees.
Again here is a list all of the records in the table before we remove them.

SQL> select * from employee;

     EMPID EMPNAME    EMPSSN
---------- ---------- -----------
         1 Jack       555-55-5555
         2 Joe        555-56-5555
         3 Fred       555-57-5555
         4 Mike       555-58-5555
         5 Cathy      555-59-5555
         6 Lisa       555-70-5555
         7 Jack       555-55-5555
         8 Mike       555-58-5555
         9 Cathy      555-59-5555
        10 Lisa       555-70-5555
        11 Lisa       555-70-5555
Now, let's get rid of those duplicate records by running this query:
delete from employee
  where (empid, empssn)
    not in
    ( select min(empid), empssn
         from employee group by empssn);
After the above query has been executed lets select all the records from the table to see if indeed it actually worked:
SQL> select * from employee;

     EMPID EMPNAME    EMPSSN
---------- ---------- -----------
         1 Jack       555-55-5555
         2 Joe        555-56-5555
         3 Fred       555-57-5555
         4 Mike       555-58-5555
         5 Cathy      555-59-5555
         6 Lisa       555-70-5555

6 rows selected.
BAM! Problem solved, no more duplicates, now finally you can go to bed and get some sleep. WAIT! Just before you do, let me throw a little wrench into the situation, lets say that you have duplicate records once again but this time the empids and empSSNs are the same.
For example the data now looks like this:

insert into employee values(1, 'Jack', '555-55-5555');
insert into employee values (2, 'Joe', '555-56-5555');
insert into employee values (3, 'Fred', '555-57-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
insert into employee values (1, 'Jack', '555-55-5555');
insert into employee values (4, 'Mike', '555-58-5555');
insert into employee values (5, 'Cathy', '555-59-5555');
insert into employee values (6 ,'Lisa', '555-70-5555');
insert into employee values (6, 'Lisa', '555-70-5555');
Now, how can you figure out what data duplicate as now there is no unique identifier in the table. The Ids are the same and the SSN are the same. This is just great!
No problem, relax, there is actually a distinct value in the table but you can't see it. This little helper is called the rowid.

Remember there is nothing to uniquely identify the data as you can see by the above insert statements so you will now have to rely on the rowid to get us out of this mess. In order to delete the duplicate values in the employee table you will now need to write a query like this:

 
delete from employee
   where (rowid, empssn)
    not in
     (select min(rowid), empssn from employee group by empssn);
And hence, after the delete query above has been executed your data will appear as:
SQL> select * from employee
  2  /

     EMPID EMPNAME    EMPSSN
---------- ---------- -----------
         1 Jack       555-55-5555
         2 Joe        555-56-5555
         3 Fred       555-57-5555
         4 Mike       555-58-5555
         5 Cathy      555-59-5555
         6 Lisa       555-70-5555

6 rows selected.
As you can see all the duplicate values have been removed from the table and now you finally have unique records in the table, so before you call it quits or before someone else slips in and enters more duplicate data make sure you add a primary key to that EMPID or EMPSSN column so this wont happen again!
Now finally, you will be able to get that long awaited sleep!

IMPORTANT NOTE: Remember rowid will always uniquely identify a row in a table but please do not use this as a primary key. I only showed you this little trick to help you get out of a bind when the database is poorly designed by allowing duplicate data to be entered.
===============
The only way you can automatically remove duplicates is if you have a unique key (that perferably also is chronological). In this case you don't so it is not gonna be easy. You could add an autonumber field. That would be unique and could be used to deleted the duplicates.

The trouble is that if you would select the records to be deleted by the duplicate fields, that all records with that address would be deleted.

If you have a unique key you can prevent the record you want kept to be removed.

OK, now for an example. First create a query that will select the records with duplicates (which actually selects the duplicates you want to keep):

SELECT [Street Address], Max(tblAddress.ID) AS ID
FROM tblAddress
GROUP BY [Street Address]
HAVING (((Count(*))>1));

This selects all duplicate Street Address records. I added the autonumber field and named it ID. The query selects the MAX of the duplicate ID's. So the last one is kept. (If you want to keep the oldest record change it to MIN.

Now save the query as "qryFindDuplicates" and create the following query that will delete the other duplicates:

DELETE tblAddress.ID
FROM tblAddress INNER JOIN qryFindDuplicates ON tblAddress.[Street Address] = qryFindDuplicates.[Street Address]
WHERE (((tblAddress.ID)<>[qryFindDuplicates].[ID]));


This will delete the duplicates, but leave the ones that were selected for keeping (from the previous query)
=============
 have often found that poor database design creates the need for inventive SQL solutions. In fact, the whole concept of "normalizing" databases sprang from eliminating certain "anomalies" when querying the database. When faced with a sub-optimal database design, the ideal solution from a technical standpoint is to redesign the database and correct its flaws. However, redesigning the database often is not an option due to time or budget constraints, pre-existing software, or simple lack of authority to change the design. We must often play the hand we are dealt. In this article, I'll show you how to leverage advanced join techniques to get the most from design-challenged databases.

Suppose the person that designed your database fell asleep in database theory class when the professor was discussing entity integrity. As a result, the schema lacks primary keys and is rife with duplicate data. Since its difficult to write a quality piece of software against a database with duplicate data, you decide to create primary keys on the tables. Your only problem is that you must first identify and eliminate the duplicate data.

There are several ways of finding and eliminating duplicate data; some are better suited to specific situations than others. The technique that I'll show you in this article is a relatively straightforward technique that should work in a wide variety of situations. To demonstrate the technique, I'm going to use the following table:


CREATE TABLE Payment(
      CustomerNumber      int            NOT NULL,
      PostedDatetime      datetime      NOT NULL,
      PaymentAmt      money            NOT NULL,
      PaymentMemo      varchar(200)      NOT NULL,
      CCRefNum      char(10)      NOT NULL
)


The first step to a duplicate-free table is coming up with a way of uniquely identifying each row. There's no way you can delete a particular row without being able to distinguish it from the others. The easiest way to accomplish this is to add an identity column to this table and make it the primary key.


ALTER TABLE Payment ADD
PaymentID int NOT NULL IDENTITY(1, 1),
      CONSTRAINT PK_Payment PRIMARY KEY(PaymentID)


Now that you have the ability to specify a particular row to delete, you must decide how to define "duplicate". Which columns should uniquely identify a row? In real-world scenarios, the answer will be based upon how the table has been used and the actual data contained in it; however, for this example, a combination of CustomerNumber, PostedDatetime, and PaymentAmt will serve as the logical key. It makes sense that the same customer would not post two payments of exactly the same amount at exactly the same time.

Having decided what a duplicate is, you need some way of identifying them. To identify the duplicate rows, you need to construct a query which returns each unique CustomerNumber / PostedDatetime / PaymentAmt combination where there is more than one row with the same key values.


SELECT
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
FROM
      Payment
GROUP BY
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
HAVING
      COUNT(*) > 1


This query groups by the key values to return one row per unique set of values. The HAVING clause then filters out all groups where there is only one row for that set of values. This leaves a rowset containing only those logical key values where there are duplicates.

The next problem to solve is how to eliminate all but one row per set of key values. The PaymentID column added earlier provides an answer. If we could somehow get exactly one PaymentID for each set of duplicates, we could delete all rows where the CustomerNumber / PostedDatetime / PaymentAmt is returned in the above query, but the PaymentID does not match. This would delete all but one row per set of duplicates. How can we accomplish this?

When grouping data, aggregate functions can be used to return data about the group as a whole. While some of the aggregate functions such as SUM() and AVG() return values which don't exist on any particular row in the group, the extrema functions MIN() and MAX() return values which exist in an actual row in the group. Hence, the extrema functions can be used to return an arbitrary value from some set of values.

Using this knowledge about MIN() and MAX(), we can modify the above query to return the CustomerNumber / PostedDatetime / PaymentAmt for each set of duplicates along with exactly one unique PaymentID.


SELECT
      MAX(PaymentID) AS PaymentID,
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
FROM
      Payment
GROUP BY
      CustomerNumber,
      PostedDatetime,
      PaymentAmt
HAVING
      COUNT(*) > 1


This query returns all the data we need to perform the delete operation, but how do we tell the database engine to actually delete the records? We can't simply change the SELECT to DELETE because DELETE statements don't support grouping and aggregates. However, the DELETE statement does support joining to other tables. We can use a technique called a derived table to treat the above SELECT statement as if it were a table. We can then write a DELETE statement which joins to this derived table and performs the DELETE.


DELETE FROM
      p1
FROM
      Payment p1
INNER JOIN
      (
            SELECT
                  MAX(PaymentID) AS PaymentID,
                  CustomerNumber,
                  PostedDatetime,
                  PaymentAmt
            FROM
                  Payment
            GROUP BY
                  CustomerNumber,
                  PostedDatetime,
                  PaymentAmt
            HAVING
                  COUNT(*) > 1
      ) p2
      ON(
            p1.CustomerNumber = p2.CustomerNumber
            AND
            p1.PostedDatetime = p2.PostedDatetime
            AND
            p1.PaymentAmt = p2.PaymentAmt
            AND
            p1.PaymentID <> p2.PaymentID
      )


The most interesting part of this query is the join criteria. We join on CustomerNumber, PostedDatetime, and PaymentAmt to delete rows that match logically to the set of duplicates that we've found. If we left the criteria at this point, we'd end up deleting the duplicates, but we wouldn't have any copies of the duplicate rows remaining. Our requirements stated that we delete all the duplicates, but leave one copy of each duplicated row. To do this, we add the fourth criterion stating the PaymentID not match one of the PaymentID values in the set of duplicates. Thus, one row per set of duplicates is retained while deleting the others.

Hopefully, you'll never need to use these techniques to delete duplicate data. However, if you do, you won't have to struggle to find a technique that precisely and efficiently performs the task. Even if you don't use this technique to delete duplicates, many of the elements of the solution - derived tables, joined deletes, grouping and aggregation - can be applied to other situations that require inventive query techniques.


0
 
LVL 1

Author Comment

by:WingYip
ID: 18347833
Thanks for the answer

Sorry about delay

Wing
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I see at least one EE question a week that pertains to using temporary tables in MS Access.  But surprisingly, I was unable to find a single article devoted solely to this topic. I don’t intend to describe all of the uses of temporary tables in t…
Describes a method of obtaining an object variable to an already running instance of Microsoft Access so that it can be controlled via automation.
Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…
Using Microsoft Access, learn some simple rules for how to construct tables in a relational database. Split up all multi-value fields into single values: Split up fields that belong to other things into separate tables: Make sure that all record…

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question