?
Solved

SQL Query to find duplicates with different scenarios

Posted on 2010-03-26
41
Medium Priority
?
618 Views
Last Modified: 2012-05-09
I have a SQL table with many fields. I need to find duplicates in them but using different scenarios. I would like to use a single query to check all dupe conditions at once. I am not sure how to do that. The query should check one by one condition and give me the number of dupes.

Please find the attached spreadsheet where I have laid out my sample data with dupe check conditions.

Hope my explanation is clear.

Your help is greatly appreciated.

Thanks,
MM_OK
Dupe-check.xls
0
Comment
Question by:MM_OK
  • 20
  • 15
  • 6
41 Comments
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 28696901
Here is a start.  It assumes that there is an ID field on the table, so that the non-equality check can be made easily.  If this is not true, you need to identify a certain method that ensures that you are not comparing a row to itself.

This does not yet handle that oddball strip alpha version of the bk/pg check.

SELECT *
FROM MyTable t1
INNER JOIN My Table t2
    ON t1.St = t2.St
    AND t1.Country = t2.Country
    AND t1.File = t2.File
WHERE (
    t1.bk = t2.bk
    AND t1.pg = t2.pg
    )
    OR
    (t1.DocNo = t2.DocNo)
    OR
    (
    t1.DocType = t2.DocType
    AND t1.pin = t2.pin
    )
   AND t1.id <> t2.id   ---
0
 

Author Comment

by:MM_OK
ID: 28699262
Thanks for your response.

I tried to apply the your against my table with primary key and it is taking for ever to run.

Also I tweaked some fields to run the same against 'Views' and again it's running for ever. In the WHERE clause I even asked the query to look at week's worth of data. For the amount of data in the database, it should not take such long time. Am I missing something? Also I need the result in a report form.

Please help.
0
 

Author Comment

by:MM_OK
ID: 28701037
I tried using the query to test each condition separately, it works fine. Only when seeing 'OR', it does not like.  Any clues?
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

 
LVL 8

Accepted Solution

by:
WesWilson earned 2000 total points
ID: 28702417
The self-join is expensive. Here's another approach, using the same assumptions as bhess1's solution. (A unique ID exists.) This will give you a list of IDs where there is at least one duplicate of this record.

I haven't included the strip alpha case. If this works well for you, I'll work on that case.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, BK,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 28707492
To make this efficient, you'll probably need to add indexes to the tables for the fields being compared.  AT A MINIMUM, you need an index on St, County, and File (the three common fields).  Including the other data elements in the index would also help.  An index such as this:

CREATE INDEX ix_dupchk ON MyTable (St, County, File) INCLUDE (bk, pg, docno, doctype, pin)

would almost certainly speed the processing significantly.
0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 28707856
Why is it slow?  an OR condition is inherently slow IF the data to be checked is not well indexed, and/or the total data exceeeds allocated memory.  More than any other, an OR tries to hold items in memory to sort/compare.  If you run out of free memory, paging starts, and things slow down massively.

WesWilson's approach is certainly viable, but I would output a bit more data from it, such as the type of match.  With that simple added information, you can run a query to handle the duplicates, e.g.:

SELECT MAX(id) Id,
    'Bk-Pg' as Matchtype
FROM MyTable
GROUP BY St, County, File, BK,PG
HAVING count(1) > 1
UNION
SELECT MAX(id) Id,
    'DocNo' as Matchtype
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id) Id,
    'DocType-Pin' as Matchtype
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1
0
 

Author Comment

by:MM_OK
ID: 28711678
Thanks WesWilson. The SQL works. Can you please help me with 'stripping alpha' from 'Bk' field.?

Thanks bhess1 as well. I am  not allowed to use 'Create' function (as you mentioned to create Index).
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28715457
There has to be a better way than this nested REPLACE function to eliminate alphabetic characters, but this works.

Now the first case covers both the first and last case in your spreadsheet.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(BK,'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
		,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
		,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

0
 

Author Comment

by:MM_OK
ID: 28715911
This Query produces the same result as before and the reason being after eliminating the alpha character we introduce space and the other field to which we check has no preceding space.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28716178
If you just want a count of duplicates, you could turn this SELECT into a SELECT INTO query to place the ids in a temp table, then select the count of rows in the temp table and return that.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28716408
Then you can just nest one more REPLACE for a space.

Note that we are replacing the characters with an empty string, not with a space. I assume there are some spaces in the data that doesn't have alphabetic characters.
0
 

Author Comment

by:MM_OK
ID: 28716552
Is there any other way to skip alpha than creating temp table? I have no experience using SELECT INTO function or Creating Temp table. Please help. TIA.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28716944
We can skip the alpha characters without selecting into a temp table. For now, let me know if this does what you need. If it does and you want to get just a count instead of the IDs of the duplicates, we can take that as the next step.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

0
 

Author Comment

by:MM_OK
ID: 28717107
>> WesWilson:Then you can just nest one more REPLACE for a space.

Note that we are replacing the characters with an empty string, not with a space. I assume there are some spaces in the data that doesn't have alphabetic characters.

Example:
BK               BK
A100           100

So, the Replace function will skip A in A100 to get 100 and will be compared with the other 100 to dupe out.
But I am not seeing this in my result.
0
 

Author Comment

by:MM_OK
ID: 28717721
The latest SQL using additional REPLACE function is also not giving me the expected result. I get the same result as before.

I am fine with getting the IDs as Duplicates.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28717824
Hmmm... I have a test working for me on SQL 2008. I haven't checked against 2005 yet. Which version are you using?
0
 

Author Comment

by:MM_OK
ID: 28718130
I am using SQL Server 2008.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28718790
Please post the data this isn't working for. Maybe that will help.
0
 

Author Comment

by:MM_OK
ID: 28719964
Please find the attached spreadsheet. Thanks for your time.
Sample-for-dupe-check.xls
0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 28720693
Question:  Will there only be one leading character in all cases?  or is this undefined, and there could be any number of leading or trailing characters?
0
 

Author Comment

by:MM_OK
ID: 28720993
Currently it is one leading alpha, but not sure of future data, so for safer side I could say 'any number of leading characters'. Would the query differ for both the situations?
0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 28721416
If there was only one leading character in all of these cases, we would not need to check for each individual REPLACE statement. Instead, we could simply chop off the leading character for the compare. Not having trailing characters also makes this simpler in some ways.  Try this version:

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File,
    CASE
        WHEN IsNumeric(BK) = 0    -- leading character
            THEN Substring(BK, PATINDEX('%[0123456789]%', BK), Len(BK))
        ELSE BK
    END
    ,PG
HAVING count(1) > 1

If this still does not match then - using your sample data file and the IDs you provided in your sample:

SELECT ':' + BK + ':'
FROM MyTable
WHERE ID In (61410, 61451, 79961)
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28721620
I created a table with an int id column and varchar(50) columns for everything else. I inserted the data in your spreadsheet, and it works correctly in identifying the duplicate.

I did have to put brackets around File, but otherwise had no problems.

I don't know why this isn't working for you. What's the data type for Bk column?
0
 

Author Comment

by:MM_OK
ID: 28721833
WesWilson: 'Bk' field is Varchar type.

bhess1: I get the same result using your query.
0
 
LVL 32

Expert Comment

by:Brendt Hess
ID: 28721980
Can you post the output of the query:

SELECT ':' + BK + ':' as BK,
    Len(BK) as lbk
FROM MyTable
WHERE ID In (61410, 61451, 79961)

(assuming the IDs are correct, matching ones from your data)

We need to see if there are any leading or trailing characters, hidden or not....
0
 

Author Comment

by:MM_OK
ID: 28722130
BK      lbk
:D8708:      5
:D8708:      5
:8708:      4
0
 

Author Comment

by:MM_OK
ID: 28731022
>>WesWilson:I created a table with an int id column and varchar(50) columns for everything else. I inserted the data in your spreadsheet, and it works correctly in identifying the duplicate.

I did have to put brackets around File, but otherwise had no problems.

I don't know why this isn't working for you. What's the data type for Bk column?

>>
I followed your query to the 'T' but no luck. I know it is weekend and it is insane to bother you.
When you get time, please help. Thanks again.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28789700
Please run this and let me know what you get back. Maybe it will show what the difference is that the query is finding:

SELECT MAX(id), St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1

Open in new window

0
 

Author Comment

by:MM_OK
ID: 28988920
Sorry for late response.

Yes, This query does strip off the alpha in BK field and showing me just the number.

Once again my scenario is like this:

ID ST COUNTY FILE BK  PG




0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28989192
So that query is showing two entries that appear identical, correct?

Please verify that the other fields are actually identical--no extra spaces for example.
0
 

Author Comment

by:MM_OK
ID: 28989313
ID               ST             COUNTY                 FILE                BK                  PG
61410        MI              IONIA                        DE                  8708              210
61451        MI               IONIA                        DE                 D8708            210
79961        MI               IONIA                        DE                 D8708            210
 
Expected result:
Dupe ID: 79961  and 61410
0
 

Author Comment

by:MM_OK
ID: 28989669
My careless mistake.

It is showing me only one entry with 'alpha' stripping. Not the other one with no alpha.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28990204
I see. That is what it would do--just return 79961. It will only return one id per set of duplicates. Once you remove 79961, then it would show you 61451. Is this acceptable behavior?

If not, would adding a field for number of duplicates be sufficient?

Or do we need to actually list all of the ids?
0
 

Author Comment

by:MM_OK
ID: 28990599
I think probably adding a field for number of duplicates would be sufficient? In that case would it show 2 dupes for the latest scenario I mentioned?
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28992255
Here's the last query with a duplicate count added.

SELECT MAX(id), St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG, COUNT(1) as DuplicateCount
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1

Open in new window

0
 
LVL 8

Expert Comment

by:WesWilson
ID: 28992549
I should note that DuplicateCount includes all matching records, including the original. If you don't want to include the original in the count, change it to:

COUNT(1)-1 as DuplicateCount
0
 

Author Comment

by:MM_OK
ID: 28996545
Thanks Wilson.

Can this (last query) be integrated with the first query that you helped me with all the 3 scenarios? I tried but pulling the same results as before. I will run these separately. Thanks for all your support and time.

I will close the thread if it fine with you.
0
 
LVL 8

Assisted Solution

by:WesWilson
WesWilson earned 2000 total points
ID: 29002512
Let me know if there are any problems with this.

SELECT MAX(id), COUNT(1)-1 AS DuplicateCount
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id), COUNT(1)-1 AS DuplicateCount
FROM MyTable
GROUP BY St, County, [File], DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id), COUNT(1)-1 AS DuplicateCount
FROM MyTable
GROUP BY St, County, [File], DocType, PIN
HAVING count(1) > 1

Open in new window

0
 

Author Comment

by:MM_OK
ID: 29006282
1) I ran this query as such and see both IDs appearing on the result. But the duplicate count against the ID 61410 is 2 and the dupe count against ID 79961 is 1.

2) As I am dealing with huge volume of data, I introduced WHERE Clause in all the Select Statements and
I am seeing only one of three IDs.
0
 
LVL 8

Expert Comment

by:WesWilson
ID: 29044710
Since we are running three separate queries and doing a union to bring them together, some could appear in more than one query, and thus be listed more than once.
0
 

Author Closing Comment

by:MM_OK
ID: 31707590
Thanks a bunch for helping me to get through this SQL. Thanks for your time.
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

INTRODUCTION: While tying your database objects into builds and your enterprise source control system takes a third-party product (like Visual Studio Database Edition or Red-Gate's SQL Source Control), you can achieve some protection using a sing…
In this article, we will show how to detach and attach a database and then show how to repair a corrupt database and attach it, If it has some errors. We will show how to detach and attach using SSMS or using T-SQL sentences.
SQL Database Recovery Software repairs the MDF & NDF Files, corrupted due to hardware related issues or software related errors. Provides preview of recovered database objects and allows saving in either MSSQL, CSV, HTML or XLS format. Ensures recov…
Stellar Phoenix SQL Database Repair software easily fixes the suspect mode issue of SQL Server database. It is a simple process to bring the database from suspect mode to normal mode. Check out the video and fix the SQL database suspect mode problem.
Suggested Courses

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question