Link to home
Start Free TrialLog in
Avatar of MM_OK
MM_OK

asked on

SQL Query to find duplicates with different scenarios

I have a SQL table with many fields. I need to find duplicates in them but using different scenarios. I would like to use a single query to check all dupe conditions at once. I am not sure how to do that. The query should check one by one condition and give me the number of dupes.

Please find the attached spreadsheet where I have laid out my sample data with dupe check conditions.

Hope my explanation is clear.

Your help is greatly appreciated.

Thanks,
MM_OK
Dupe-check.xls
Avatar of Brendt Hess
Brendt Hess
Flag of United States of America image

Here is a start.  It assumes that there is an ID field on the table, so that the non-equality check can be made easily.  If this is not true, you need to identify a certain method that ensures that you are not comparing a row to itself.

This does not yet handle that oddball strip alpha version of the bk/pg check.

SELECT *
FROM MyTable t1
INNER JOIN My Table t2
    ON t1.St = t2.St
    AND t1.Country = t2.Country
    AND t1.File = t2.File
WHERE (
    t1.bk = t2.bk
    AND t1.pg = t2.pg
    )
    OR
    (t1.DocNo = t2.DocNo)
    OR
    (
    t1.DocType = t2.DocType
    AND t1.pin = t2.pin
    )
   AND t1.id <> t2.id   ---
Avatar of MM_OK
MM_OK

ASKER

Thanks for your response.

I tried to apply the your against my table with primary key and it is taking for ever to run.

Also I tweaked some fields to run the same against 'Views' and again it's running for ever. In the WHERE clause I even asked the query to look at week's worth of data. For the amount of data in the database, it should not take such long time. Am I missing something? Also I need the result in a report form.

Please help.
Avatar of MM_OK

ASKER

I tried using the query to test each condition separately, it works fine. Only when seeing 'OR', it does not like.  Any clues?
ASKER CERTIFIED SOLUTION
Avatar of WesWilson
WesWilson
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
To make this efficient, you'll probably need to add indexes to the tables for the fields being compared.  AT A MINIMUM, you need an index on St, County, and File (the three common fields).  Including the other data elements in the index would also help.  An index such as this:

CREATE INDEX ix_dupchk ON MyTable (St, County, File) INCLUDE (bk, pg, docno, doctype, pin)

would almost certainly speed the processing significantly.
Why is it slow?  an OR condition is inherently slow IF the data to be checked is not well indexed, and/or the total data exceeeds allocated memory.  More than any other, an OR tries to hold items in memory to sort/compare.  If you run out of free memory, paging starts, and things slow down massively.

WesWilson's approach is certainly viable, but I would output a bit more data from it, such as the type of match.  With that simple added information, you can run a query to handle the duplicates, e.g.:

SELECT MAX(id) Id,
    'Bk-Pg' as Matchtype
FROM MyTable
GROUP BY St, County, File, BK,PG
HAVING count(1) > 1
UNION
SELECT MAX(id) Id,
    'DocNo' as Matchtype
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id) Id,
    'DocType-Pin' as Matchtype
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1
Avatar of MM_OK

ASKER

Thanks WesWilson. The SQL works. Can you please help me with 'stripping alpha' from 'Bk' field.?

Thanks bhess1 as well. I am  not allowed to use 'Create' function (as you mentioned to create Index).
There has to be a better way than this nested REPLACE function to eliminate alphabetic characters, but this works.

Now the first case covers both the first and last case in your spreadsheet.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(replace(replace(replace(replace(
		replace(BK,'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
		,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
		,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

Avatar of MM_OK

ASKER

This Query produces the same result as before and the reason being after eliminating the alpha character we introduce space and the other field to which we check has no preceding space.
If you just want a count of duplicates, you could turn this SELECT into a SELECT INTO query to place the ids in a temp table, then select the count of rows in the temp table and return that.
Then you can just nest one more REPLACE for a space.

Note that we are replacing the characters with an empty string, not with a space. I assume there are some spaces in the data that doesn't have alphabetic characters.
Avatar of MM_OK

ASKER

Is there any other way to skip alpha than creating temp table? I have no experience using SELECT INTO function or Creating Temp table. Please help. TIA.
We can skip the alpha characters without selecting into a temp table. For now, let me know if this does what you need. If it does and you want to get just a count instead of the IDs of the duplicates, we can take that as the next step.

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocNo
HAVING count(1) > 1
UNION
SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File, DocType, PIN
HAVING count(1) > 1

Open in new window

Avatar of MM_OK

ASKER

>> WesWilson:Then you can just nest one more REPLACE for a space.

Note that we are replacing the characters with an empty string, not with a space. I assume there are some spaces in the data that doesn't have alphabetic characters.

Example:
BK               BK
A100           100

So, the Replace function will skip A in A100 to get 100 and will be compared with the other 100 to dupe out.
But I am not seeing this in my result.
Avatar of MM_OK

ASKER

The latest SQL using additional REPLACE function is also not giving me the expected result. I get the same result as before.

I am fine with getting the IDs as Duplicates.
Hmmm... I have a test working for me on SQL 2008. I haven't checked against 2005 yet. Which version are you using?
Avatar of MM_OK

ASKER

I am using SQL Server 2008.
Please post the data this isn't working for. Maybe that will help.
Avatar of MM_OK

ASKER

Please find the attached spreadsheet. Thanks for your time.
Sample-for-dupe-check.xls
Question:  Will there only be one leading character in all cases?  or is this undefined, and there could be any number of leading or trailing characters?
Avatar of MM_OK

ASKER

Currently it is one leading alpha, but not sure of future data, so for safer side I could say 'any number of leading characters'. Would the query differ for both the situations?
If there was only one leading character in all of these cases, we would not need to check for each individual REPLACE statement. Instead, we could simply chop off the leading character for the compare. Not having trailing characters also makes this simpler in some ways.  Try this version:

SELECT MAX(id)
FROM MyTable
GROUP BY St, County, File,
    CASE
        WHEN IsNumeric(BK) = 0    -- leading character
            THEN Substring(BK, PATINDEX('%[0123456789]%', BK), Len(BK))
        ELSE BK
    END
    ,PG
HAVING count(1) > 1

If this still does not match then - using your sample data file and the IDs you provided in your sample:

SELECT ':' + BK + ':'
FROM MyTable
WHERE ID In (61410, 61451, 79961)
I created a table with an int id column and varchar(50) columns for everything else. I inserted the data in your spreadsheet, and it works correctly in identifying the duplicate.

I did have to put brackets around File, but otherwise had no problems.

I don't know why this isn't working for you. What's the data type for Bk column?
Avatar of MM_OK

ASKER

WesWilson: 'Bk' field is Varchar type.

bhess1: I get the same result using your query.
Can you post the output of the query:

SELECT ':' + BK + ':' as BK,
    Len(BK) as lbk
FROM MyTable
WHERE ID In (61410, 61451, 79961)

(assuming the IDs are correct, matching ones from your data)

We need to see if there are any leading or trailing characters, hidden or not....
Avatar of MM_OK

ASKER

BK      lbk
:D8708:      5
:D8708:      5
:8708:      4
Avatar of MM_OK

ASKER

>>WesWilson:I created a table with an int id column and varchar(50) columns for everything else. I inserted the data in your spreadsheet, and it works correctly in identifying the duplicate.

I did have to put brackets around File, but otherwise had no problems.

I don't know why this isn't working for you. What's the data type for Bk column?

>>
I followed your query to the 'T' but no luck. I know it is weekend and it is insane to bother you.
When you get time, please help. Thanks again.
Please run this and let me know what you get back. Maybe it will show what the difference is that the query is finding:

SELECT MAX(id), St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1

Open in new window

Avatar of MM_OK

ASKER

Sorry for late response.

Yes, This query does strip off the alpha in BK field and showing me just the number.

Once again my scenario is like this:

ID ST COUNTY FILE BK  PG




So that query is showing two entries that appear identical, correct?

Please verify that the other fields are actually identical--no extra spaces for example.
Avatar of MM_OK

ASKER

ID               ST             COUNTY                 FILE                BK                  PG
61410        MI              IONIA                        DE                  8708              210
61451        MI               IONIA                        DE                 D8708            210
79961        MI               IONIA                        DE                 D8708            210
 
Expected result:
Dupe ID: 79961  and 61410
Avatar of MM_OK

ASKER

My careless mistake.

It is showing me only one entry with 'alpha' stripping. Not the other one with no alpha.
I see. That is what it would do--just return 79961. It will only return one id per set of duplicates. Once you remove 79961, then it would show you 61451. Is this acceptable behavior?

If not, would adding a field for number of duplicates be sufficient?

Or do we need to actually list all of the ids?
Avatar of MM_OK

ASKER

I think probably adding a field for number of duplicates would be sufficient? In that case would it show 2 dupes for the latest scenario I mentioned?
Here's the last query with a duplicate count added.

SELECT MAX(id), St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG, COUNT(1) as DuplicateCount
FROM MyTable
GROUP BY St, County, [File], replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(replace(replace(replace(
                replace(replace(BK,' ',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G','')
                ,'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P','')
                ,'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')
,PG
HAVING count(1) > 1

Open in new window

I should note that DuplicateCount includes all matching records, including the original. If you don't want to include the original in the count, change it to:

COUNT(1)-1 as DuplicateCount
Avatar of MM_OK

ASKER

Thanks Wilson.

Can this (last query) be integrated with the first query that you helped me with all the 3 scenarios? I tried but pulling the same results as before. I will run these separately. Thanks for all your support and time.

I will close the thread if it fine with you.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of MM_OK

ASKER

1) I ran this query as such and see both IDs appearing on the result. But the duplicate count against the ID 61410 is 2 and the dupe count against ID 79961 is 1.

2) As I am dealing with huge volume of data, I introduced WHERE Clause in all the Select Statements and
I am seeing only one of three IDs.
Since we are running three separate queries and doing a union to bring them together, some could appear in more than one query, and thus be listed more than once.
Avatar of MM_OK

ASKER

Thanks a bunch for helping me to get through this SQL. Thanks for your time.