Solved

remove duplicates based on specified field

Posted on 2006-10-24
24
332 Views
Last Modified: 2012-08-14
Question that need to be solved: We need a function who can remove duplicates from lists based on a specified field.

To be clear..
* Let say we have a database with 100000 rows, companys.
* Some of those companys has been imported twice or more.
* Filter out all except those duplicates based on a specific field (f.e. "CorpID" a textfield who only takes integers as input mask).

We want to be able to delete (and count) every duplicates except one.
* It would be excellent if there is possible to control what duplicate to keep (f.e. keep the duplicate with last date in XX field).


This question is based on http://www.experts-exchange.com/Databases/MS_Access/Q_22034007.html who was solved with excellence.
0
Comment
Question by:dingir
  • 10
  • 9
  • 3
  • +1
24 Comments
 
LVL 65

Expert Comment

by:rockiroads
ID: 17793915
U could try something like this, based on a query like this

SELECT CorpID, Count(CorpID)
FROM mytable
GROUP BY CorpID
HAVING COUNT(CorpID) > 1

e.g.

public function ReturnCount(byval sField as String) as Long

    Dim sSql as String
    Dim rs as DAO.Recordset

    sSql = "SELECT " & sField & ", Count(" & sField & ") as Cnt " & _
              "FROM mytable " & _
              "GROUP BY " & sField & " " & _
              "HAVING COUNT(" & sField & ") > 1"
    set rs = currentdb.openrecordset(rs)
    if rs.EOF = True then
        ReturnCount = 0
    Else
        ReturnCount = rs!Cnt
    End if
    rs.close
    set rs=nothing
end function


Now add this into a module somewhere and call it
e.g.
Dim lCnt as Long

lCnt = ReturnCount("CorpID")

0
 
LVL 77

Expert Comment

by:peter57r
ID: 17793919
Hi dingir,

The most common route to deleting duplicates is to append the whole table to a new empty copy of the table containing the correct unique indexes.
This rejects all the duplicates.
If you sort the append query on date Descending it should then retain the latest record.

Pete
0
 
LVL 65

Expert Comment

by:rockiroads
ID: 17793945
Next step then I guess is the delete

If the count > 0 then u can perform another query and delete
e.g.

public function DelDups(byval sCriteria as String)

    Dim rs as DAO.Recordset
    Dim sSql as String

    sSql = "SELECT * FROM mytable WHERE " & sCriteria
    set rs=currentdb.openrecordset(sSql)
    if rs.eof = false then rs.movelast
    do while rs.bof = False
        rs.Delete
        rs.moveprevious
    loop
    rs.close
    set rs=nothing
end function


Now u can call this by doing this

DelDups ("CorpDate <> #01/0/1/2001#")

Simply call passing in criteria


Note, backup your table/db first!!!
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17794007
This will create a query of all duplicate values.  Lets call it dupQry.  As usual don't forget to back up table before running this.

SELECT OldList.*
WHERE (((OldList.CorpID) In (SELECT [CorpID] FROM [OldList] As Tmp GROUP BY [CorpID] HAVING Count(*)>1 )))
ORDER BY OldList.CorpID;

Now run query based on previous query results:

DELETE dupQry.*, dupQry.DateFieldName
FROM dupQry
WHERE (((dupQry.DateFieldName)<>DMax("DateFieldName","dupQry","[CorpID]=" & [CorpID])));
0
 
LVL 1

Author Comment

by:dingir
ID: 17794052
Hi! thank's all.

mpmccarthy.
Knowing sql inside out must feels like being god? :)
Your solution is really simple and (nearly) working! Like the previous question I need a little adjustment.

The Select query works fine when typed "SELECT * FROM OldList" instead of "SELECT OldList.*."
I've got a type mismatch error running the delete query. May it be the data type of CorpID? (Who's PeOrgnummer in real). Do you have any advice? The errors looks it looping for every matching duplicated row - i needed to close the database through ctrl + alt + del =p.
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17794278
If CorpID is a text field then change to:

DELETE dupQry.*, dupQry.DateFieldName
FROM dupQry
WHERE (((dupQry.DateFieldName)<>DMax("DateFieldName","dupQry","[CorpID]='" & [CorpID] & "'")));


>Knowing sql inside out must feels like being god? :)
I wish I knew a lot more than I do.  I still make stupid mistakes.
 

0
 
LVL 1

Author Comment

by:dingir
ID: 17796310
Hi Mp!

Ah, same procedure as with DLookup of course.
What should I try if Microsoft Access hangs itself just after started the query?

The progressbar is filling up four dots and stopped there.
It doesn't have with time to run the query to do, I leaved the computer for lunch and it was still at four dots :). Same scenario on one of our servers, a LOT more CPU.
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17796607
Essentially it's going to take time doing it this way.

You are running a query (the DMax) against every record in 1000000 records.

Only other option is to use recordsets like:

Public Function DeleteDuplicates()
Dim db As DAO.Database
Dim rs As RecordSet
Dim MaxDate As Date

Set db = CurrentDb
Set rs = db.OpenRecordset("dupQry")

rs.MoveFirst

Do Until rs.EOF
  MaxDate=DMax("DateFieldName","dupQry","[CorpID]='" & Me.[CorpID] & "'")
  If rs!DateFieldName<>MaxDate Then
    rs.Delete
  End If
  rs.MoveNext
Loop
 
rs.close
set rs = Nothing
set db = Nothing

End Function
0
 
LVL 1

Author Comment

by:dingir
ID: 17799387
Hi! I thought so.. Then we are near rockiroads solution.
I will test that recordset tomorrow or thursday.

Is this only because I want to control what of those duplicates to keep?
And if I don't care about what duplicate being deleted? so long just one post will be keeped. - As a opposite, because in some cases that doesn't matter.
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17800402
It doesn't matter which record you are keeping you still have to run some query to identify it.
0
 
LVL 1

Author Comment

by:dingir
ID: 17810952
Well ehrm than i'm afraid that's not an answer on my question :-) :-/.
Two seconds to search/delete each post, 85000 posts. 85000 * 2 is about 47 hours - A night only have 8 hours :).


: Any other solutions?

rockiroad
I've also tried your easy typed function but that solution delete all duplicates and does not leave one left. If there are three identical posts (except counter, of course) only two of them should be removed (the duplicates). After those two are removed and just one is still left, there no duplicates any more.

0
 
LVL 77

Expert Comment

by:peter57r
ID: 17811130
If you use the technique I described it will take about 10 seconds.

Pete
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 1

Author Comment

by:dingir
ID: 17811182
Would be very helped with a code sample
0
 
LVL 77

Expert Comment

by:peter57r
ID: 17811249
There is no code involved.

Pete
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17813796
What Pete is talking about (no offense intended Pete ...) is a good suggestion.

Create a copy (structure only) of the table.  Then in design view of the new table change the Indexed property of the CorpID text field to Yes (No Duplicates).

Now create an append query:

INSERT INTO NewTableName
SELECT OldTableName.*
FROM OldTableName
ORDER BY OldTableName.DateXX DESC;

This will throw an error saying some records can't be appended but that's what you want so continue.

If this works the points should go to Pete, it was his suggestion.
0
 
LVL 1

Author Comment

by:dingir
ID: 17904769
Hi mpmccarthy! Of course. You putted it little easier for my head, though.

Both of you:
 This solutions means that I can't remove the duplicates from an existing table? Did I read the syntax correct?
 Do I need to create a new table for the result to be saved?
 If this should be used my situation, the duplicates needs to be removed from the table where they already exist. All those records has values from joined tables.


Please correct me if im wrong..
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17911999
BACKUP the table first!!!!
------------------------------

Try the following. First add a field to the table called temp with a true/false datatype. You can delete it later.


Public Function DeleteDuplicate()
Dim db As DAO.Database
Dim rs As RecordSet
Dim sCorpID As String

    Set db = CurrentDb
    Set rs = db.OpenRecordset("dupQry")

StartFile:

    rs.MoveFirst
    rs.FindFirst rs!temp = False
    If rs.NoMatch then
        GoTo EndFile
    End If

    sCorpID=rs!CorpID
    MaxDate=DMax("DateFieldName","dupQry","[CorpID]='" & sCorpID & "'")
    Do Until rs.EOF
        If rs!DateFieldName<>MaxDate Then
            rs.Delete
        Else
            rs!temp = True
        End If
        rs.MoveNext
    Loop

    GoTo StartFile
 
EndFile:

    rs.close
    set rs = Nothing
    set db = Nothing

End Function
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17912004
Sorry

Replace duplicate query with the original table name in recordset and Dmax statement
0
 
LVL 1

Author Comment

by:dingir
ID: 17913056
Hi mpmccarthy

Got error 3251 and on an other try 3070. Can't retype the error message exactly because our messages are in Swedish. the Err.Number is same, though. It's about that the function doesn't support that type of object. The error 3070 says that Jet can't identify a True as a valid fieldname or expression. Both errors on row:

    rs.FindFirst rs!Temp = False

I've attached a yes/no named Temp. also replaced dupQry with the SQL Syntax for the table including the related table.
I tried with a 0 instead of False and also tried with ADODB. Unfortunately i don't handle the ADODB corresponding of findfirst and .nomatch
0
 
LVL 9

Expert Comment

by:mpmccarthy
ID: 17914089
Try

rs.FindFirst rs!Temp = No  ...   and rs!temp = Yes further down the code

or

rs.FindFirst rs!Temp = "No"  ...   and rs!temp = "Yes" further down the code

Mary
0
 
LVL 1

Author Comment

by:dingir
ID: 17920595
Hi mpmccarthy!
Uhm no.. I don't think that error it's about the ye/no/true/false/0/-1 thing. I've tried those two anyway and, as I though, same error appear. As I understand of the error the methof FindFirst can't accept a expression. At least not with that format, but I'm sure that I'm the one who now less :-).
0
 
LVL 1

Author Comment

by:dingir
ID: 17920858
Hey mcmccarthy!

I solved that problem with the syntax    
  rs.FindFirst "[Temp] = " & 0 & ""
  rs.FindFirst "[Temp] = " & -1 & ""

Then I got something to work. This code doesn't tell me how it's selecting if both date are same, also a datefield is slower then a integer field. I replaced the datefield and checking the index-field instead. Keeping the one with highest number, was my though. I past my exact code below (with correct fieldnames).

I've debug.printed all peorgnummers going through the delete routine. That process tooked 5 seconds, after that a time glass appear for at least 20 minutes. I breaked the code and after that my table has 1 post left. Nice, not! :(. Lucky to have test environments :p.


Public Function DeleteDuplicate()
   
    Dim sCorpID As String
    Dim MaxDate As Integer
   
    Dim db As DAO.Database
    Dim rs As Recordset

    Set db = CurrentDb
    Set rs = db.OpenRecordset("SELECT * FROM Nystartade", dbOpenDynaset)
       
StartFile:

    rs.MoveFirst
    rs.FindFirst "[Temp] = " & 0 & ""


    If rs.NoMatch Then
        GoTo EndFile
    End If

    sCorpID = rs!peorgnummer
    MaxDate = DMax("Nr", "Nystartade", "[PeorgNummer]='" & sCorpID & "'")

    Do Until rs.EOF
        If rs!Nr <> MaxDate Then
            Debug.Print rs!peorgnummer
            rs.Delete
        Else
            rs.FindFirst "[Temp] = " & -1 & ""
        End If
        rs.MoveNext
    Loop

    GoTo StartFile

EndFile:

    rs.Close
    Set rs = Nothing
    Set db = Nothing

End Function


0
 
LVL 9

Accepted Solution

by:
mpmccarthy earned 500 total points
ID: 17923623
Change the Yes/No Field to a True/False field.

Is 'Nr' field a number datatype?

Public Function DeleteDuplicate()
   
    Dim sCorpID As String
    Dim MaxNo As Integer
    Dim dupRec As Boolean
    Dim db As DAO.Database
    Dim rs As Recordset

    Set db = CurrentDb
    Set rs = db.OpenRecordset("SELECT * FROM Nystartade ORDER BY [Nr];", dbOpenDynaset) ' may speed it up
       
StartFile:

    rs.MoveFirst
    rs.FindFirst "[Temp]=False"

    If rs.NoMatch Then
        GoTo EndFile
    End If

    sCorpID = rs!peorgnummer
    MaxNo = DMax("Nr", "Nystartade", "[PeorgNummer]='" & sCorpID & "'")

    If rs!Nr <> MaxNo Then
        Debug.Print rs!peorgnummer
        rs.Delete
    Else
        rs!Temp=True
    End If

    dupRec=True
    Do Until dupRec=False
        rs.FindNext "[PeorgNummer]='" & sCorpID & "'"
        If rs.NoMatch Then
            dupRec=False
        Else
            dupRec=True
            If rs!Nr <> MaxNo Then
                Debug.Print rs!peorgnummer
                rs.Delete
            Else
                rs!Temp=True
            End If          
        End If
    Loop

    GoTo StartFile

EndFile:

    rs.Close
    Set rs = Nothing
    Set db = Nothing

End Function
0
 
LVL 1

Author Comment

by:dingir
ID: 17930543
mpmccarthy! Impressive! I just changed the MaxNo to a Long instead of Integer. It works as expected, deleting the "oldest" of the duplicates". The function takes some time but fairly well, based on what effect it does! Very Nice! would double the points if I could :-).
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Experts-Exchange is a great place to come for help with solutions for your database issues, and many problems are resolved within minutes of being posted.  Others take a little more time and effort and often providing a sample database is very helpf…
Overview: This article:       (a) explains one principle method to cross-reference invoice items in Quickbooks®       (b) explores the reasons one might need to cross-reference invoice items       (c) provides a sample process for creating a M…
As developers, we are not limited to the functions provided by the VBA language. In addition, we can call the functions that are part of the Windows operating system. These functions are part of the Windows API (Application Programming Interface). U…
What’s inside an Access Desktop Database. Will look at the basic interface, Navigation Pane (Database Container), Tables, Queries, Forms, Report, Macro’s, and VBA code.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now