Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 351
  • Last Modified:

remove duplicates based on specified field

Question that need to be solved: We need a function who can remove duplicates from lists based on a specified field.

To be clear..
* Let say we have a database with 100000 rows, companys.
* Some of those companys has been imported twice or more.
* Filter out all except those duplicates based on a specific field (f.e. "CorpID" a textfield who only takes integers as input mask).

We want to be able to delete (and count) every duplicates except one.
* It would be excellent if there is possible to control what duplicate to keep (f.e. keep the duplicate with last date in XX field).


This question is based on http://www.experts-exchange.com/Databases/MS_Access/Q_22034007.html who was solved with excellence.
0
dingir
Asked:
dingir
  • 10
  • 9
  • 3
  • +1
1 Solution
 
rockiroadsCommented:
U could try something like this, based on a query like this

SELECT CorpID, Count(CorpID)
FROM mytable
GROUP BY CorpID
HAVING COUNT(CorpID) > 1

e.g.

public function ReturnCount(byval sField as String) as Long

    Dim sSql as String
    Dim rs as DAO.Recordset

    sSql = "SELECT " & sField & ", Count(" & sField & ") as Cnt " & _
              "FROM mytable " & _
              "GROUP BY " & sField & " " & _
              "HAVING COUNT(" & sField & ") > 1"
    set rs = currentdb.openrecordset(rs)
    if rs.EOF = True then
        ReturnCount = 0
    Else
        ReturnCount = rs!Cnt
    End if
    rs.close
    set rs=nothing
end function


Now add this into a module somewhere and call it
e.g.
Dim lCnt as Long

lCnt = ReturnCount("CorpID")

0
 
peter57rCommented:
Hi dingir,

The most common route to deleting duplicates is to append the whole table to a new empty copy of the table containing the correct unique indexes.
This rejects all the duplicates.
If you sort the append query on date Descending it should then retain the latest record.

Pete
0
 
rockiroadsCommented:
Next step then I guess is the delete

If the count > 0 then u can perform another query and delete
e.g.

public function DelDups(byval sCriteria as String)

    Dim rs as DAO.Recordset
    Dim sSql as String

    sSql = "SELECT * FROM mytable WHERE " & sCriteria
    set rs=currentdb.openrecordset(sSql)
    if rs.eof = false then rs.movelast
    do while rs.bof = False
        rs.Delete
        rs.moveprevious
    loop
    rs.close
    set rs=nothing
end function


Now u can call this by doing this

DelDups ("CorpDate <> #01/0/1/2001#")

Simply call passing in criteria


Note, backup your table/db first!!!
0
Easily Design & Build Your Next Website

Squarespace’s all-in-one platform gives you everything you need to express yourself creatively online, whether it is with a domain, website, or online store. Get started with your free trial today, and when ready, take 10% off your first purchase with offer code 'EXPERTS'.

 
mpmccarthyCommented:
This will create a query of all duplicate values.  Lets call it dupQry.  As usual don't forget to back up table before running this.

SELECT OldList.*
WHERE (((OldList.CorpID) In (SELECT [CorpID] FROM [OldList] As Tmp GROUP BY [CorpID] HAVING Count(*)>1 )))
ORDER BY OldList.CorpID;

Now run query based on previous query results:

DELETE dupQry.*, dupQry.DateFieldName
FROM dupQry
WHERE (((dupQry.DateFieldName)<>DMax("DateFieldName","dupQry","[CorpID]=" & [CorpID])));
0
 
dingirAuthor Commented:
Hi! thank's all.

mpmccarthy.
Knowing sql inside out must feels like being god? :)
Your solution is really simple and (nearly) working! Like the previous question I need a little adjustment.

The Select query works fine when typed "SELECT * FROM OldList" instead of "SELECT OldList.*."
I've got a type mismatch error running the delete query. May it be the data type of CorpID? (Who's PeOrgnummer in real). Do you have any advice? The errors looks it looping for every matching duplicated row - i needed to close the database through ctrl + alt + del =p.
0
 
mpmccarthyCommented:
If CorpID is a text field then change to:

DELETE dupQry.*, dupQry.DateFieldName
FROM dupQry
WHERE (((dupQry.DateFieldName)<>DMax("DateFieldName","dupQry","[CorpID]='" & [CorpID] & "'")));


>Knowing sql inside out must feels like being god? :)
I wish I knew a lot more than I do.  I still make stupid mistakes.
 

0
 
dingirAuthor Commented:
Hi Mp!

Ah, same procedure as with DLookup of course.
What should I try if Microsoft Access hangs itself just after started the query?

The progressbar is filling up four dots and stopped there.
It doesn't have with time to run the query to do, I leaved the computer for lunch and it was still at four dots :). Same scenario on one of our servers, a LOT more CPU.
0
 
mpmccarthyCommented:
Essentially it's going to take time doing it this way.

You are running a query (the DMax) against every record in 1000000 records.

Only other option is to use recordsets like:

Public Function DeleteDuplicates()
Dim db As DAO.Database
Dim rs As RecordSet
Dim MaxDate As Date

Set db = CurrentDb
Set rs = db.OpenRecordset("dupQry")

rs.MoveFirst

Do Until rs.EOF
  MaxDate=DMax("DateFieldName","dupQry","[CorpID]='" & Me.[CorpID] & "'")
  If rs!DateFieldName<>MaxDate Then
    rs.Delete
  End If
  rs.MoveNext
Loop
 
rs.close
set rs = Nothing
set db = Nothing

End Function
0
 
dingirAuthor Commented:
Hi! I thought so.. Then we are near rockiroads solution.
I will test that recordset tomorrow or thursday.

Is this only because I want to control what of those duplicates to keep?
And if I don't care about what duplicate being deleted? so long just one post will be keeped. - As a opposite, because in some cases that doesn't matter.
0
 
mpmccarthyCommented:
It doesn't matter which record you are keeping you still have to run some query to identify it.
0
 
dingirAuthor Commented:
Well ehrm than i'm afraid that's not an answer on my question :-) :-/.
Two seconds to search/delete each post, 85000 posts. 85000 * 2 is about 47 hours - A night only have 8 hours :).


: Any other solutions?

rockiroad
I've also tried your easy typed function but that solution delete all duplicates and does not leave one left. If there are three identical posts (except counter, of course) only two of them should be removed (the duplicates). After those two are removed and just one is still left, there no duplicates any more.

0
 
peter57rCommented:
If you use the technique I described it will take about 10 seconds.

Pete
0
 
dingirAuthor Commented:
Would be very helped with a code sample
0
 
peter57rCommented:
There is no code involved.

Pete
0
 
mpmccarthyCommented:
What Pete is talking about (no offense intended Pete ...) is a good suggestion.

Create a copy (structure only) of the table.  Then in design view of the new table change the Indexed property of the CorpID text field to Yes (No Duplicates).

Now create an append query:

INSERT INTO NewTableName
SELECT OldTableName.*
FROM OldTableName
ORDER BY OldTableName.DateXX DESC;

This will throw an error saying some records can't be appended but that's what you want so continue.

If this works the points should go to Pete, it was his suggestion.
0
 
dingirAuthor Commented:
Hi mpmccarthy! Of course. You putted it little easier for my head, though.

Both of you:
 This solutions means that I can't remove the duplicates from an existing table? Did I read the syntax correct?
 Do I need to create a new table for the result to be saved?
 If this should be used my situation, the duplicates needs to be removed from the table where they already exist. All those records has values from joined tables.


Please correct me if im wrong..
0
 
mpmccarthyCommented:
BACKUP the table first!!!!
------------------------------

Try the following. First add a field to the table called temp with a true/false datatype. You can delete it later.


Public Function DeleteDuplicate()
Dim db As DAO.Database
Dim rs As RecordSet
Dim sCorpID As String

    Set db = CurrentDb
    Set rs = db.OpenRecordset("dupQry")

StartFile:

    rs.MoveFirst
    rs.FindFirst rs!temp = False
    If rs.NoMatch then
        GoTo EndFile
    End If

    sCorpID=rs!CorpID
    MaxDate=DMax("DateFieldName","dupQry","[CorpID]='" & sCorpID & "'")
    Do Until rs.EOF
        If rs!DateFieldName<>MaxDate Then
            rs.Delete
        Else
            rs!temp = True
        End If
        rs.MoveNext
    Loop

    GoTo StartFile
 
EndFile:

    rs.close
    set rs = Nothing
    set db = Nothing

End Function
0
 
mpmccarthyCommented:
Sorry

Replace duplicate query with the original table name in recordset and Dmax statement
0
 
dingirAuthor Commented:
Hi mpmccarthy

Got error 3251 and on an other try 3070. Can't retype the error message exactly because our messages are in Swedish. the Err.Number is same, though. It's about that the function doesn't support that type of object. The error 3070 says that Jet can't identify a True as a valid fieldname or expression. Both errors on row:

    rs.FindFirst rs!Temp = False

I've attached a yes/no named Temp. also replaced dupQry with the SQL Syntax for the table including the related table.
I tried with a 0 instead of False and also tried with ADODB. Unfortunately i don't handle the ADODB corresponding of findfirst and .nomatch
0
 
mpmccarthyCommented:
Try

rs.FindFirst rs!Temp = No  ...   and rs!temp = Yes further down the code

or

rs.FindFirst rs!Temp = "No"  ...   and rs!temp = "Yes" further down the code

Mary
0
 
dingirAuthor Commented:
Hi mpmccarthy!
Uhm no.. I don't think that error it's about the ye/no/true/false/0/-1 thing. I've tried those two anyway and, as I though, same error appear. As I understand of the error the methof FindFirst can't accept a expression. At least not with that format, but I'm sure that I'm the one who now less :-).
0
 
dingirAuthor Commented:
Hey mcmccarthy!

I solved that problem with the syntax    
  rs.FindFirst "[Temp] = " & 0 & ""
  rs.FindFirst "[Temp] = " & -1 & ""

Then I got something to work. This code doesn't tell me how it's selecting if both date are same, also a datefield is slower then a integer field. I replaced the datefield and checking the index-field instead. Keeping the one with highest number, was my though. I past my exact code below (with correct fieldnames).

I've debug.printed all peorgnummers going through the delete routine. That process tooked 5 seconds, after that a time glass appear for at least 20 minutes. I breaked the code and after that my table has 1 post left. Nice, not! :(. Lucky to have test environments :p.


Public Function DeleteDuplicate()
   
    Dim sCorpID As String
    Dim MaxDate As Integer
   
    Dim db As DAO.Database
    Dim rs As Recordset

    Set db = CurrentDb
    Set rs = db.OpenRecordset("SELECT * FROM Nystartade", dbOpenDynaset)
       
StartFile:

    rs.MoveFirst
    rs.FindFirst "[Temp] = " & 0 & ""


    If rs.NoMatch Then
        GoTo EndFile
    End If

    sCorpID = rs!peorgnummer
    MaxDate = DMax("Nr", "Nystartade", "[PeorgNummer]='" & sCorpID & "'")

    Do Until rs.EOF
        If rs!Nr <> MaxDate Then
            Debug.Print rs!peorgnummer
            rs.Delete
        Else
            rs.FindFirst "[Temp] = " & -1 & ""
        End If
        rs.MoveNext
    Loop

    GoTo StartFile

EndFile:

    rs.Close
    Set rs = Nothing
    Set db = Nothing

End Function


0
 
mpmccarthyCommented:
Change the Yes/No Field to a True/False field.

Is 'Nr' field a number datatype?

Public Function DeleteDuplicate()
   
    Dim sCorpID As String
    Dim MaxNo As Integer
    Dim dupRec As Boolean
    Dim db As DAO.Database
    Dim rs As Recordset

    Set db = CurrentDb
    Set rs = db.OpenRecordset("SELECT * FROM Nystartade ORDER BY [Nr];", dbOpenDynaset) ' may speed it up
       
StartFile:

    rs.MoveFirst
    rs.FindFirst "[Temp]=False"

    If rs.NoMatch Then
        GoTo EndFile
    End If

    sCorpID = rs!peorgnummer
    MaxNo = DMax("Nr", "Nystartade", "[PeorgNummer]='" & sCorpID & "'")

    If rs!Nr <> MaxNo Then
        Debug.Print rs!peorgnummer
        rs.Delete
    Else
        rs!Temp=True
    End If

    dupRec=True
    Do Until dupRec=False
        rs.FindNext "[PeorgNummer]='" & sCorpID & "'"
        If rs.NoMatch Then
            dupRec=False
        Else
            dupRec=True
            If rs!Nr <> MaxNo Then
                Debug.Print rs!peorgnummer
                rs.Delete
            Else
                rs!Temp=True
            End If          
        End If
    Loop

    GoTo StartFile

EndFile:

    rs.Close
    Set rs = Nothing
    Set db = Nothing

End Function
0
 
dingirAuthor Commented:
mpmccarthy! Impressive! I just changed the MaxNo to a Long instead of Integer. It works as expected, deleting the "oldest" of the duplicates". The function takes some time but fairly well, based on what effect it does! Very Nice! would double the points if I could :-).
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

  • 10
  • 9
  • 3
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now