SQL Query to find duplicates with different scenarios

I have a SQL table with many fields. I need to find duplicates in them but using different scenarios. I would like to use a single query to check all dupe conditions at once. I am not sure how to do that. The query should check one by one condition and give me the number of dupes.

Please find the attached spreadsheet where I have laid out my sample data with dupe check conditions.

Hope my explanation is clear.

Your help is greatly appreciated.

Thanks,
MM_OK
Dupe-check.xls
MM_OKAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Brendt HessSenior DBACommented:
Here is a start.  It assumes that there is an ID field on the table, so that the non-equality check can be made easily.  If this is not true, you need to identify a certain method that ensures that you are not comparing a row to itself.

This does not yet handle that oddball strip alpha version of the bk/pg check.

SELECT *
FROM MyTable t1
INNER JOIN My Table t2
    ON t1.St = t2.St
    AND t1.Country = t2.Country
    AND t1.File = t2.File
WHERE (
    t1.bk = t2.bk
    AND t1.pg = t2.pg
    )
    OR
    (t1.DocNo = t2.DocNo)
    OR
    (
    t1.DocType = t2.DocType
    AND t1.pin = t2.pin
    )
   AND t1.id <> t2.id   ---
0
MM_OKAuthor Commented:
Thanks for your response.

I tried to apply the your against my table with primary key and it is taking for ever to run.

Also I tweaked some fields to run the same against 'Views' and again it's running for ever. In the WHERE clause I even asked the query to look at week's worth of data. For the amount of data in the database, it should not take such long time. Am I missing something? Also I need the result in a report form.

Please help.
0
MM_OKAuthor Commented:
I tried using the query to test each condition separately, it works fine. Only when seeing 'OR', it does not like.  Any clues?
0
Cloud Class® Course: Certified Penetration Testing

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

WesWilsonCommented:
The self-join is expensive. Here's another approach, using the same assumptions as bhess1's solution. (A unique ID exists.) This will give you a list of IDs where there is at least one duplicate of this record.

I haven't included the strip alpha case. If this works well for you, I'll work on that case.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, BK,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Brendt HessSenior DBACommented:
To make this efficient, you'll probably need to add indexes to the tables for the fields being compared.  AT A MINIMUM, you need an index on St, County, and File (the three common fields).  Including the other data elements in the index would also help.  An index such as this:

CREATE INDEX ix_dupchk ON MyTable (St, County, File) INCLUDE (bk, pg, docno, doctype, pin)

would almost certainly speed the processing significantly.
0
Brendt HessSenior DBACommented:
Why is it slow?  an OR condition is inherently slow IF the data to be checked is not well indexed, and/or the total data exceeeds allocated memory.  More than any other, an OR tries to hold items in memory to sort/compare.  If you run out of free memory, paging starts, and things slow down massively.

WesWilson's approach is certainly viable, but I would output a bit more data from it, such as the type of match.  With that simple added information, you can run a query to handle the duplicates, e.g.:

SELECT MAX(id) Id,
    'Bk-Pg' as Matchtype
FROM MyTable
GROUP BY St, County, File, BK,PG
HAVING count(1) > 1
UNION
SELECT MAX(id) Id,
    'DocNo' as Matchtype
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id) Id,
    'DocType-Pin' as Matchtype
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1
0
MM_OKAuthor Commented:
Thanks WesWilson. The SQL works. Can you please help me with 'stripping alpha' from 'Bk' field.?

Thanks bhess1 as well. I am  not allowed to use 'Create' function (as you mentioned to create Index).
0
WesWilsonCommented:
There has to be a better way than this nested REPLACE function to eliminate alphabetic characters, but this works.

Now the first case covers both the first and last case in your spreadsheet.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(BK,'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
		,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
		,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

0
MM_OKAuthor Commented:
This Query produces the same result as before and the reason being after eliminating the alpha character we introduce space and the other field to which we check has no preceding space.
0
WesWilsonCommented:
If you just want a count of duplicates, you could turn this SELECT into a SELECT INTO query to place the ids in a temp table, then select the count of rows in the temp table and return that.
0
WesWilsonCommented:
Then you can just nest one more REPLACE for a space.

Note that we are replacing the characters with an empty string, not with a space. I assume there are some spaces in the data that doesn't have alphabetic characters.
0
MM_OKAuthor Commented:
Is there any other way to skip alpha than creating temp table? I have no experience using SELECT INTO function or Creating Temp table. Please help. TIA.
0
WesWilsonCommented:
We can skip the alpha characters without selecting into a temp table. For now, let me know if this does what you need. If it does and you want to get just a count instead of the IDs of the duplicates, we can take that as the next step.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

0
MM_OKAuthor Commented:
>> WesWilson:Then you can just nest one more REPLACE for a space.

Note that we are replacing the characters with an empty string, not with a space. I assume there are some spaces in the data that doesn't have alphabetic characters.

Example:
BK               BK
A100           100

So, the Replace function will skip A in A100 to get 100 and will be compared with the other 100 to dupe out.
But I am not seeing this in my result.
0
MM_OKAuthor Commented:
The latest SQL using additional REPLACE function is also not giving me the expected result. I get the same result as before.

I am fine with getting the IDs as Duplicates.
0
WesWilsonCommented:
Hmmm... I have a test working for me on SQL 2008. I haven't checked against 2005 yet. Which version are you using?
0
MM_OKAuthor Commented:
I am using SQL Server 2008.
0
WesWilsonCommented:
Please post the data this isn't working for. Maybe that will help.
0
MM_OKAuthor Commented:
Please find the attached spreadsheet. Thanks for your time.
Sample-for-dupe-check.xls
0
Brendt HessSenior DBACommented:
Question:  Will there only be one leading character in all cases?  or is this undefined, and there could be any number of leading or trailing characters?
0
MM_OKAuthor Commented:
Currently it is one leading alpha, but not sure of future data, so for safer side I could say 'any number of leading characters'. Would the query differ for both the situations?
0
Brendt HessSenior DBACommented:
If there was only one leading character in all of these cases, we would not need to check for each individual REPLACE statement. Instead, we could simply chop off the leading character for the compare. Not having trailing characters also makes this simpler in some ways.  Try this version:

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File,
    CASE
        WHEN IsNumeric(BK) = 0    -- leading character
            THEN Substring(BK, PATINDEX('%[0123456789]%', BK), Len(BK))
        ELSE BK
    END
    ,PG
HAVING count(1) > 1

If this still does not match then - using your sample data file and the IDs you provided in your sample:

SELECT ':' + BK + ':'
FROM MyTable
WHERE ID In (61410, 61451, 79961)
0
WesWilsonCommented:
I created a table with an int id column and varchar(50) columns for everything else. I inserted the data in your spreadsheet, and it works correctly in identifying the duplicate.

I did have to put brackets around File, but otherwise had no problems.

I don't know why this isn't working for you. What's the data type for Bk column?
0
MM_OKAuthor Commented:
WesWilson: 'Bk' field is Varchar type.

bhess1: I get the same result using your query.
0
Brendt HessSenior DBACommented:
Can you post the output of the query:

SELECT ':' + BK + ':' as BK,
    Len(BK) as lbk
FROM MyTable
WHERE ID In (61410, 61451, 79961)

(assuming the IDs are correct, matching ones from your data)

We need to see if there are any leading or trailing characters, hidden or not....
0
MM_OKAuthor Commented:
BK      lbk
:D8708:      5
:D8708:      5
:8708:      4
0
MM_OKAuthor Commented:
>>WesWilson:I created a table with an int id column and varchar(50) columns for everything else. I inserted the data in your spreadsheet, and it works correctly in identifying the duplicate.

I did have to put brackets around File, but otherwise had no problems.

I don't know why this isn't working for you. What's the data type for Bk column?

>>
I followed your query to the 'T' but no luck. I know it is weekend and it is insane to bother you.
When you get time, please help. Thanks again.
0
WesWilsonCommented:
Please run this and let me know what you get back. Maybe it will show what the difference is that the query is finding:

SELECT MAX(id), St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1

Open in new window

0
MM_OKAuthor Commented:
Sorry for late response.

Yes, This query does strip off the alpha in BK field and showing me just the number.

Once again my scenario is like this:

ID ST COUNTY FILE BK  PG




0
WesWilsonCommented:
So that query is showing two entries that appear identical, correct?

Please verify that the other fields are actually identical--no extra spaces for example.
0
MM_OKAuthor Commented:
ID               ST             COUNTY                 FILE                BK                  PG
61410        MI              IONIA                        DE                  8708              210
61451        MI               IONIA                        DE                 D8708            210
79961        MI               IONIA                        DE                 D8708            210
 
Expected result:
Dupe ID: 79961  and 61410
0
MM_OKAuthor Commented:
My careless mistake.

It is showing me only one entry with 'alpha' stripping. Not the other one with no alpha.
0
WesWilsonCommented:
I see. That is what it would do--just return 79961. It will only return one id per set of duplicates. Once you remove 79961, then it would show you 61451. Is this acceptable behavior?

If not, would adding a field for number of duplicates be sufficient?

Or do we need to actually list all of the ids?
0
MM_OKAuthor Commented:
I think probably adding a field for number of duplicates would be sufficient? In that case would it show 2 dupes for the latest scenario I mentioned?
0
WesWilsonCommented:
Here's the last query with a duplicate count added.

SELECT MAX(id), St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG, COUNT(1) as DuplicateCount
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1

Open in new window

0
WesWilsonCommented:
I should note that DuplicateCount includes all matching records, including the original. If you don't want to include the original in the count, change it to:

COUNT(1)-1 as DuplicateCount
0
MM_OKAuthor Commented:
Thanks Wilson.

Can this (last query) be integrated with the first query that you helped me with all the 3 scenarios? I tried but pulling the same results as before. I will run these separately. Thanks for all your support and time.

I will close the thread if it fine with you.
0
WesWilsonCommented:
Let me know if there are any problems with this.

SELECT MAX(id), COUNT(1)-1 AS DuplicateCount
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id), COUNT(1)-1 AS DuplicateCount
FROM MyTable
GROUP BY St, County, [File], DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id), COUNT(1)-1 AS DuplicateCount
FROM MyTable
GROUP BY St, County, [File], DocType, PIN
HAVING count(1) > 1

Open in new window

0
MM_OKAuthor Commented:
1) I ran this query as such and see both IDs appearing on the result. But the duplicate count against the ID 61410 is 2 and the dupe count against ID 79961 is 1.

2) As I am dealing with huge volume of data, I introduced WHERE Clause in all the Select Statements and
I am seeing only one of three IDs.
0
WesWilsonCommented:
Since we are running three separate queries and doing a union to bring them together, some could appear in more than one query, and thus be listed more than once.
0
MM_OKAuthor Commented:
Thanks a bunch for helping me to get through this SQL. Thanks for your time.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server 2008

From novice to tech pro — start learning today.