?
Solved

Finding duplicate records using a cursor

Posted on 2003-03-14
8
Medium Priority
?
382 Views
Last Modified: 2012-08-13
Basic problem:  Get all fields (or some fields) of ALL records where a selected set of fields is duplicated.

I have a database which will have about 300,000 records (88 fields each) which I can naturally break down into groups of about 10,000 records, if that is helpful.  I need to get the partially duplicated records.  By "partially duplicated" I mean "pick any 2-4 fields to be the key (such as LastName+FirstName+PostalCode) and return all fields of all records (not just 1 per group) where that "costructed key" is duplicated".  There is a unique-per-record field called Seq, if that helps, but it will never be part of my "constructed key".

So, to give an example, my data might be (in Canada)

Seq First Last   Postal Birth    Colour
2   John  Smith  K2S1A3 03/01/01 Red
8   Mary  Smith  J3Y0K6 60/04/17 Blue
27  John  Smith  K2S1A3 55/02/02 Green
45  Marty Black  K2S1A3 98/10/21 Green

and if I decide my key is Last+First+Postal, then I would like the output
2   John  Smith  K2S1A3 03/01/01 Red
27  John  Smith  K2S1A3 55/02/02 Blue

or if my key is Colour+Postal, then the output would be
27  John  Smith  K2S1A3 55/02/02 Green
45  Marty Black  K2S1A3 98/10/21 Green

Now I have done this with the LEFT OUTER JOIN
   SELECT *
   FROM myDatabase
   WHERE (Seq IN (SELECT DISTINCT t1.Seq
      FROM myDatabase AS t1 LEFT OUTER JOIN myDatabase AS t2
      ON t1.Last = t2.Last AND t1.First = t2.First AND
         t1.Postal = t2.Postal WHERE NOT ( t1.Seq = t2.Seq ) ) )
   ORDER BY Last_Name, First_Name
but over 10K records that is so slow it often times out.

What I would like to do is have a simple SELECT statement like
   SELECT * FROM myDatabase
   WHERE <restrictions here to reduce set size below 10K>
   ORDER BY <constructed key>
and then somehow use the following pseudocode to output all the records whose contructed key is duplicated:

// Get the first record
prevRecord = <first record in cursor>

// Assume the first record will not be duplicated
keepPrev = FALSE

// Get the second record
currentRecord = <next record in cursor>

WHILE ( currentRecord IS NOT <past end of cursor> AND
        we have not "kept" more than <50> records)
{
    // Do we need to output the current record in the next round?
    keepCurrent = ( currectRecord-constructed-key ==
                             prevRecord-constructed-key )  

    // Either previous match or current match requires putting out the record
    IF ( keepPrev OR keepCurrent )
        KEEP/OUTPUT prevRecord
    else
        DISCARD/DO NOT OUTPUT prevRecord

    // Current record is about to become the previous record
    keepPrev = keepCurrent

    // This may not be necessary if I can just do
    // "get last record" for the OUTPUT statement above
    prevRecord = currentRecord

    // Move on to the next record
    currentRecord = <next record in cursor>
}

// OUTPUT the last record we reviewed if necessary
IF ( <not more than 50 record kept> AND keepPrev )
    KEEP/OUTPUT prevRecord
else
    DISCARD/DO NOT OUTPUT prevRecord
DISCARD/DO NOT OUTPUT all the rest of the records


So my request is, please give me the SQL code for doing the above, if it is possible using an SQL cursor or something similar)?  (I have been trying to get the LEFT OUTER JOIN and similar other solutions to work all week, but cannot get them efficient enough.)

I use MS SQL Server but I have one restriction. I can only send a single string to the server because the request is sent through an intermediary process which submits the request and sends back the set of database records returned.

So, can the above be done by one long statement with semi-colons?

Thanks!
   Mark Batten-Carew (PKI Security is my specialty)
   markbc@paulmartin.ca


0
Comment
Question by:markbc
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 7

Expert Comment

by:TroyK
ID: 8139675
Mark;

I briefly scanned your question, and it looks like you may be doing some data scrubbing for a mailing or something similar.

Check out www.dataflux.com or www.melissadata.com. They have software that may be of use to you.

I doubt that a cursor is going to give you better performance than a set-based solution. If you can constrain your "constructed key" columns list, you may be able to implement some covering indexes to speed performance.

HTH,
TroyK, MCSD
0
 
LVL 69

Accepted Solution

by:
Scott Pletcher earned 1000 total points
ID: 8140244
The sub-query method is probably very inefficient for this, especially with LOJ.  Maybe try the IJ method:


SELECT *
  FROM myTable
  INNER JOIN (
    SELECT last, first, postal
    FROM myTable
    GROUP BY last, first, postal
    HAVING COUNT(*) > 1
) AS myTableDups ON myTable.last = myTableDups.last AND myTable.first = myTableDups.first AND
  myTable.postal = myTableDups.postal
0
 
LVL 54

Expert Comment

by:nico5038
ID: 8142224
I would use the PassThrough query possibility.
This will send the SQL (that must be in the database SQL format!) to the database and only the results will be returned.

Performing the query in access will force access to fetch all rows to be able to select the needed ones.

You can fill the PassThrough query from code to get the needed field flexibility.

Need more info ?

Nic;o)
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 4

Expert Comment

by:higginspi
ID: 8152382
You are wanting an output of duplicates?

Seq First Last   Postal Birth    Colour

select A.Seq, A.First, A.Last, A.Postal, A.Birth, A.Colour
from YourTable as A
join
(select First, Last, Postal
from YourTable
group by First, Last, Postal
having count(*)>1) as B
on A.First=B.First
and A.Last=B.Last
and A.Postal=B.Postal

should give you:
2   John  Smith  K2S1A3 03/01/01 Red
27  John  Smith  K2S1A3 55/02/02 Blue

select A.Seq, A.First, A.Last, A.Postal, A.Birth, A.Colour
from YourTable as A
join
(select Colour, Postal
from YourTable
group by Colour, Postal
having count(*)>1) as B
on A.Colour=B.Colour
and A.Postal=B.Postal

should give you
27  John  Smith  K2S1A3 55/02/02 Green
45  Marty Black  K2S1A3 98/10/21 Green
0
 
LVL 4

Expert Comment

by:higginspi
ID: 8152468
Tested above queries and they give the output you desired.
0
 
LVL 4

Expert Comment

by:higginspi
ID: 8152484
Oh, duh, just realized Scott's queries gets the duplicates.  At first glance I thought he was unduplicated people.  This is going to be a long day. :-(
0
 

Author Comment

by:markbc
ID: 8153628
TroyK, thanks for the info about data cleaning products but the types of problems in our data do not lend themselves to automatic fixing.  We really need the list of duplicates for human reconciliation.  About using indices that match the likely constrained keys, thanks - this may be a very useful way to speed up the searches since there aren't that many field combinations that will be used.

ScottPletcher, I know you didn't answer the question as asked but I will be giving you the points.  I am certain that I had already tried INNER JOIN but found that there was no way *through matching sequence numbers* to get every duplicated record.  That was my mistake.  Your inner join to count(*)>1 did the trick.  And it is at least a factor of 20 faster than the LEFT OUTER JOIN I was using (worst case inner join is 3 minutes, compared to moderate case outer join which I killed after 20 minutes, unfinished).  I have found that the length of time is VERY sensitive to the number of fields being returned per record and not so sensitive to the total number of records being searched.

Nico, I researched what PassThrough meant and I don't think I could use that solution because I have to be able to create a single string command to be passed through an intermediary application which is submitting the query and returning the results to me.  PassThrough seems to require a lot of active control over the SQL connection.

HigginsPI, thanks.  INNER JOIN to count(*)>1 was what I needed but as you noticed, Scott beat you to it.

Thanks very much everyone.  I lost 3-4 days last week trying to solve this by myself and you guys solved it effectively overnight.  Very much appreciated.

Mark Batten-Carew

0
 

Author Comment

by:markbc
ID: 8153641
Thanks, this solved my problem perfectly.
Morale:  The question being asked is not always what needs answering. -:)
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
I have a large data set and a SSIS package. How can I load this file in multi threading?
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.
Suggested Courses

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question