We help IT Professionals succeed at work.

We've partnered with Certified Experts, Carl Webster and Richard Faulkner, to bring you two Citrix podcasts. Learn about 2020 trends and get answers to your biggest Citrix questions!Listen Now

x

# remove duplicates based on specified field

on
Medium Priority
360 Views
Question that need to be solved: We need a function who can remove duplicates from lists based on a specified field.

To be clear..
* Let say we have a database with 100000 rows, companys.
* Some of those companys has been imported twice or more.
* Filter out all except those duplicates based on a specific field (f.e. "CorpID" a textfield who only takes integers as input mask).

We want to be able to delete (and count) every duplicates except one.
* It would be excellent if there is possible to control what duplicate to keep (f.e. keep the duplicate with last date in XX field).

This question is based on http://www.experts-exchange.com/Databases/MS_Access/Q_22034007.html who was solved with excellence.
Comment
Watch Question

## View Solution Only

CERTIFIED EXPERT
Top Expert 2006

Commented:
U could try something like this, based on a query like this

SELECT CorpID, Count(CorpID)
FROM mytable
GROUP BY CorpID
HAVING COUNT(CorpID) > 1

e.g.

public function ReturnCount(byval sField as String) as Long

Dim sSql as String
Dim rs as DAO.Recordset

sSql = "SELECT " & sField & ", Count(" & sField & ") as Cnt " & _
"FROM mytable " & _
"GROUP BY " & sField & " " & _
"HAVING COUNT(" & sField & ") > 1"
set rs = currentdb.openrecordset(rs)
if rs.EOF = True then
ReturnCount = 0
Else
ReturnCount = rs!Cnt
End if
rs.close
set rs=nothing
end function

Now add this into a module somewhere and call it
e.g.
Dim lCnt as Long

lCnt = ReturnCount("CorpID")

CERTIFIED EXPERT

Commented:
Hi dingir,

The most common route to deleting duplicates is to append the whole table to a new empty copy of the table containing the correct unique indexes.
This rejects all the duplicates.
If you sort the append query on date Descending it should then retain the latest record.

Pete
CERTIFIED EXPERT
Top Expert 2006

Commented:
Next step then I guess is the delete

If the count > 0 then u can perform another query and delete
e.g.

public function DelDups(byval sCriteria as String)

Dim rs as DAO.Recordset
Dim sSql as String

sSql = "SELECT * FROM mytable WHERE " & sCriteria
set rs=currentdb.openrecordset(sSql)
if rs.eof = false then rs.movelast
do while rs.bof = False
rs.Delete
rs.moveprevious
loop
rs.close
set rs=nothing
end function

Now u can call this by doing this

DelDups ("CorpDate <> #01/0/1/2001#")

Simply call passing in criteria

Commented:
This will create a query of all duplicate values.  Lets call it dupQry.  As usual don't forget to back up table before running this.

SELECT OldList.*
WHERE (((OldList.CorpID) In (SELECT [CorpID] FROM [OldList] As Tmp GROUP BY [CorpID] HAVING Count(*)>1 )))
ORDER BY OldList.CorpID;

Now run query based on previous query results:

DELETE dupQry.*, dupQry.DateFieldName
FROM dupQry
WHERE (((dupQry.DateFieldName)<>DMax("DateFieldName","dupQry","[CorpID]=" & [CorpID])));

Commented:
Hi! thank's all.

mpmccarthy.
Knowing sql inside out must feels like being god? :)
Your solution is really simple and (nearly) working! Like the previous question I need a little adjustment.

The Select query works fine when typed "SELECT * FROM OldList" instead of "SELECT OldList.*."
I've got a type mismatch error running the delete query. May it be the data type of CorpID? (Who's PeOrgnummer in real). Do you have any advice? The errors looks it looping for every matching duplicated row - i needed to close the database through ctrl + alt + del =p.

Commented:
If CorpID is a text field then change to:

DELETE dupQry.*, dupQry.DateFieldName
FROM dupQry
WHERE (((dupQry.DateFieldName)<>DMax("DateFieldName","dupQry","[CorpID]='" & [CorpID] & "'")));

>Knowing sql inside out must feels like being god? :)
I wish I knew a lot more than I do.  I still make stupid mistakes.

Commented:
Hi Mp!

Ah, same procedure as with DLookup of course.
What should I try if Microsoft Access hangs itself just after started the query?

The progressbar is filling up four dots and stopped there.
It doesn't have with time to run the query to do, I leaved the computer for lunch and it was still at four dots :). Same scenario on one of our servers, a LOT more CPU.

Commented:
Essentially it's going to take time doing it this way.

You are running a query (the DMax) against every record in 1000000 records.

Only other option is to use recordsets like:

Public Function DeleteDuplicates()
Dim db As DAO.Database
Dim rs As RecordSet
Dim MaxDate As Date

Set db = CurrentDb
Set rs = db.OpenRecordset("dupQry")

rs.MoveFirst

Do Until rs.EOF
MaxDate=DMax("DateFieldName","dupQry","[CorpID]='" & Me.[CorpID] & "'")
If rs!DateFieldName<>MaxDate Then
rs.Delete
End If
rs.MoveNext
Loop

rs.close
set rs = Nothing
set db = Nothing

End Function

Commented:
Hi! I thought so.. Then we are near rockiroads solution.
I will test that recordset tomorrow or thursday.

Is this only because I want to control what of those duplicates to keep?
And if I don't care about what duplicate being deleted? so long just one post will be keeped. - As a opposite, because in some cases that doesn't matter.

Commented:
It doesn't matter which record you are keeping you still have to run some query to identify it.

Commented:
Well ehrm than i'm afraid that's not an answer on my question :-) :-/.
Two seconds to search/delete each post, 85000 posts. 85000 * 2 is about 47 hours - A night only have 8 hours :).

: Any other solutions?

I've also tried your easy typed function but that solution delete all duplicates and does not leave one left. If there are three identical posts (except counter, of course) only two of them should be removed (the duplicates). After those two are removed and just one is still left, there no duplicates any more.

CERTIFIED EXPERT

Commented:
If you use the technique I described it will take about 10 seconds.

Pete

Commented:
Would be very helped with a code sample
CERTIFIED EXPERT

Commented:
There is no code involved.

Pete

Commented:
What Pete is talking about (no offense intended Pete ...) is a good suggestion.

Create a copy (structure only) of the table.  Then in design view of the new table change the Indexed property of the CorpID text field to Yes (No Duplicates).

Now create an append query:

INSERT INTO NewTableName
SELECT OldTableName.*
FROM OldTableName
ORDER BY OldTableName.DateXX DESC;

This will throw an error saying some records can't be appended but that's what you want so continue.

If this works the points should go to Pete, it was his suggestion.

Commented:
Hi mpmccarthy! Of course. You putted it little easier for my head, though.

Both of you:
This solutions means that I can't remove the duplicates from an existing table? Did I read the syntax correct?
Do I need to create a new table for the result to be saved?
If this should be used my situation, the duplicates needs to be removed from the table where they already exist. All those records has values from joined tables.

Please correct me if im wrong..

Commented:
BACKUP the table first!!!!
------------------------------

Try the following. First add a field to the table called temp with a true/false datatype. You can delete it later.

Public Function DeleteDuplicate()
Dim db As DAO.Database
Dim rs As RecordSet
Dim sCorpID As String

Set db = CurrentDb
Set rs = db.OpenRecordset("dupQry")

StartFile:

rs.MoveFirst
rs.FindFirst rs!temp = False
If rs.NoMatch then
GoTo EndFile
End If

sCorpID=rs!CorpID
MaxDate=DMax("DateFieldName","dupQry","[CorpID]='" & sCorpID & "'")
Do Until rs.EOF
If rs!DateFieldName<>MaxDate Then
rs.Delete
Else
rs!temp = True
End If
rs.MoveNext
Loop

GoTo StartFile

EndFile:

rs.close
set rs = Nothing
set db = Nothing

End Function

Commented:
Sorry

Replace duplicate query with the original table name in recordset and Dmax statement

Commented:
Hi mpmccarthy

Got error 3251 and on an other try 3070. Can't retype the error message exactly because our messages are in Swedish. the Err.Number is same, though. It's about that the function doesn't support that type of object. The error 3070 says that Jet can't identify a True as a valid fieldname or expression. Both errors on row:

rs.FindFirst rs!Temp = False

I've attached a yes/no named Temp. also replaced dupQry with the SQL Syntax for the table including the related table.
I tried with a 0 instead of False and also tried with ADODB. Unfortunately i don't handle the ADODB corresponding of findfirst and .nomatch

Commented:
Try

rs.FindFirst rs!Temp = No  ...   and rs!temp = Yes further down the code

or

rs.FindFirst rs!Temp = "No"  ...   and rs!temp = "Yes" further down the code

Mary

Commented:
Hi mpmccarthy!
Uhm no.. I don't think that error it's about the ye/no/true/false/0/-1 thing. I've tried those two anyway and, as I though, same error appear. As I understand of the error the methof FindFirst can't accept a expression. At least not with that format, but I'm sure that I'm the one who now less :-).

Commented:
Hey mcmccarthy!

I solved that problem with the syntax
rs.FindFirst "[Temp] = " & 0 & ""
rs.FindFirst "[Temp] = " & -1 & ""

Then I got something to work. This code doesn't tell me how it's selecting if both date are same, also a datefield is slower then a integer field. I replaced the datefield and checking the index-field instead. Keeping the one with highest number, was my though. I past my exact code below (with correct fieldnames).

I've debug.printed all peorgnummers going through the delete routine. That process tooked 5 seconds, after that a time glass appear for at least 20 minutes. I breaked the code and after that my table has 1 post left. Nice, not! :(. Lucky to have test environments :p.

Public Function DeleteDuplicate()

Dim sCorpID As String
Dim MaxDate As Integer

Dim db As DAO.Database
Dim rs As Recordset

Set db = CurrentDb
Set rs = db.OpenRecordset("SELECT * FROM Nystartade", dbOpenDynaset)

StartFile:

rs.MoveFirst
rs.FindFirst "[Temp] = " & 0 & ""

If rs.NoMatch Then
GoTo EndFile
End If

sCorpID = rs!peorgnummer
MaxDate = DMax("Nr", "Nystartade", "[PeorgNummer]='" & sCorpID & "'")

Do Until rs.EOF
If rs!Nr <> MaxDate Then
Debug.Print rs!peorgnummer
rs.Delete
Else
rs.FindFirst "[Temp] = " & -1 & ""
End If
rs.MoveNext
Loop

GoTo StartFile

EndFile:

rs.Close
Set rs = Nothing
Set db = Nothing

End Function

Commented:
Change the Yes/No Field to a True/False field.

Is 'Nr' field a number datatype?

Public Function DeleteDuplicate()

Dim sCorpID As String
Dim MaxNo As Integer
Dim dupRec As Boolean
Dim db As DAO.Database
Dim rs As Recordset

Set db = CurrentDb
Set rs = db.OpenRecordset("SELECT * FROM Nystartade ORDER BY [Nr];", dbOpenDynaset) ' may speed it up

StartFile:

rs.MoveFirst
rs.FindFirst "[Temp]=False"

If rs.NoMatch Then
GoTo EndFile
End If

sCorpID = rs!peorgnummer
MaxNo = DMax("Nr", "Nystartade", "[PeorgNummer]='" & sCorpID & "'")

If rs!Nr <> MaxNo Then
Debug.Print rs!peorgnummer
rs.Delete
Else
rs!Temp=True
End If

dupRec=True
Do Until dupRec=False
rs.FindNext "[PeorgNummer]='" & sCorpID & "'"
If rs.NoMatch Then
dupRec=False
Else
dupRec=True
If rs!Nr <> MaxNo Then
Debug.Print rs!peorgnummer
rs.Delete
Else
rs!Temp=True
End If
End If
Loop

GoTo StartFile

EndFile:

rs.Close
Set rs = Nothing
Set db = Nothing

End Function

Not the solution you were looking for? Getting a personalized solution is easy.

Commented:
mpmccarthy! Impressive! I just changed the MaxNo to a Long instead of Integer. It works as expected, deleting the "oldest" of the duplicates". The function takes some time but fairly well, based on what effect it does! Very Nice! would double the points if I could :-).
##### Thanks for using Experts Exchange.

• View three pieces of content (articles, solutions, posts, and videos)
• Ask the experts questions (counted toward content limit)
• Customize your dashboard and profile

OR

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the