Solved

SQL Query to Find Duplicate Customers

Posted on 2012-03-23
12
381 Views
Last Modified: 2012-03-23
I have a table that contains all customers. Over the years we have noticed an increased number of duplicates. I would like to run a query to find these duplicates.

Example Table Column I want to query:

ID - NAME
1 - John Doe Jr Trucking
2 - john doe jr trucking
3 - John Doe Jr. Trucking
4 - ABC Trucking
5 - Avery Shipping, LLC
6 - Avery Shipping LLC

My thinking is I need to strip out all commas, periods and spaces. Then uppercase or lowercase to have the possibility of receiving a duplicate.I would like to have a query that returns ID's 1,2,3,5 and 6. Cuz all of those actually contain duplicates.

Would this be possible ?
0
Comment
Question by:nbtnotes
  • 7
  • 4
12 Comments
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757156
Question: is the server/database collation case-insensitive?  Makes a difference in how much work we have to do to find duplicates of differing cases.

Assuming no case sensitivity, this will get you a list of duplicate customer names along with a count of how many times.

select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1

Open in new window


So then building the list is easy...

select * from CustomerTable where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)

Open in new window

0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757172
The tricky part is knowing which ones are OK to remove.

Assuming that this isn't the only table in your database, other tables that reference customers will (should?) do so using the ID - not the name.

So you'll need to update all such tables to use the ID you're going to keep (to tie all the customer data to a single entry in the customer table) before dropping any duplicates from the customer table.

CAUTION (and be sure to have a backup)
0
 

Author Comment

by:nbtnotes
ID: 37757198
I won't be removing the duplicates. I will flag those as being archive because they are duplicates. All references will be retained by the ID.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757203
Good plan.

Let me know if the SELECT I provided needs tweaking.  I had to make some assumptions about table and column names, but I think you get the gist...
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757214
select replace(replace(Name, ',', ''), '.', ''), * from CustomerTable 
where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)
order by replace(replace(Name, ',', ''), '.', ''),  ID

Open in new window


That'll be a bit more readable, and you'll see the "normalized" key used to identify the duplicates.
0
 

Author Comment

by:nbtnotes
ID: 37757256
I'm new to SQL and want to make sure. The replace function only replace the values for this query and not the actual tables values ?
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757288
Yes.

To alter data in a table, you must use UPDATE (or INSERT).

SELECT alone does not make changes.
0
 

Author Comment

by:nbtnotes
ID: 37757321
The following Query received an error:

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

Open in new window


SQL Query Error
0
 
LVL 7

Assisted Solution

by:Lalit Chandra
Lalit Chandra earned 75 total points
ID: 37757390
You should not return two column within the subQuery as sql is aspecting a single value,but the inner query is returning two row... which is the root couse for your Error.

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

You want to use two  cols in the subquery then you must use EXISTS keyword in the WHERE Clause.

Hope this will resolve your issue.
0
 
LVL 6

Accepted Solution

by:
netjgrnaut earned 425 total points
ID: 37757425
Whoops... forgot to nest a SELECT to get the DupMatch

I've replaced the SQL auto-values with names that parse better (to me, at least)...

SELECT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, CustomerNumber, Name 
FROM  dbo.Customer
WHERE REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
  (SELECT DupMatch FROM
    (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, 
     COUNT(*) AS NumDups
    FROM dbo.Customer
    GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
    HAVING (COUNT(*) > 1)))
ORDER BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', ''), Name

Open in new window


I notice you're replacing empty spaces now, too.  I'd be careful with that.  It will replace *all* the empty spaces.  Are you perhaps trying to replace '  ' with ' ' (two empty spaces with one)?
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757432
...also notice that I've added the actual Name to the results, so you can visually verify that the duplicate match makes sense.
0
 

Author Closing Comment

by:nbtnotes
ID: 37758842
Thanks for the help. Everything is working now.
0

Featured Post

Forrester Webinar: xMatters Delivers 261% ROI

Guest speaker Dean Davison, Forrester Principal Consultant, explains how a Fortune 500 communication company using xMatters found these results: Achieved a 261% ROI, Experienced $753,280 in net present value benefits over 3 years and Reduced MTTR by 91% for tier 1 incidents.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

For both online and offline retail, the cross-channel business is the most recent pattern in the B2C trade space.
This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question