Solved

SQL Query to Find Duplicate Customers

Posted on 2012-03-23
12
366 Views
Last Modified: 2012-03-23
I have a table that contains all customers. Over the years we have noticed an increased number of duplicates. I would like to run a query to find these duplicates.

Example Table Column I want to query:

ID - NAME
1 - John Doe Jr Trucking
2 - john doe jr trucking
3 - John Doe Jr. Trucking
4 - ABC Trucking
5 - Avery Shipping, LLC
6 - Avery Shipping LLC

My thinking is I need to strip out all commas, periods and spaces. Then uppercase or lowercase to have the possibility of receiving a duplicate.I would like to have a query that returns ID's 1,2,3,5 and 6. Cuz all of those actually contain duplicates.

Would this be possible ?
0
Comment
Question by:nbtnotes
  • 7
  • 4
12 Comments
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757156
Question: is the server/database collation case-insensitive?  Makes a difference in how much work we have to do to find duplicates of differing cases.

Assuming no case sensitivity, this will get you a list of duplicate customer names along with a count of how many times.

select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1

Open in new window


So then building the list is easy...

select * from CustomerTable where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)

Open in new window

0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757172
The tricky part is knowing which ones are OK to remove.

Assuming that this isn't the only table in your database, other tables that reference customers will (should?) do so using the ID - not the name.

So you'll need to update all such tables to use the ID you're going to keep (to tie all the customer data to a single entry in the customer table) before dropping any duplicates from the customer table.

CAUTION (and be sure to have a backup)
0
 

Author Comment

by:nbtnotes
ID: 37757198
I won't be removing the duplicates. I will flag those as being archive because they are duplicates. All references will be retained by the ID.
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757203
Good plan.

Let me know if the SELECT I provided needs tweaking.  I had to make some assumptions about table and column names, but I think you get the gist...
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757214
select replace(replace(Name, ',', ''), '.', ''), * from CustomerTable 
where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)
order by replace(replace(Name, ',', ''), '.', ''),  ID

Open in new window


That'll be a bit more readable, and you'll see the "normalized" key used to identify the duplicates.
0
 

Author Comment

by:nbtnotes
ID: 37757256
I'm new to SQL and want to make sure. The replace function only replace the values for this query and not the actual tables values ?
0
Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757288
Yes.

To alter data in a table, you must use UPDATE (or INSERT).

SELECT alone does not make changes.
0
 

Author Comment

by:nbtnotes
ID: 37757321
The following Query received an error:

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

Open in new window


SQL Query Error
0
 
LVL 7

Assisted Solution

by:Lalit Chandra
Lalit Chandra earned 75 total points
ID: 37757390
You should not return two column within the subQuery as sql is aspecting a single value,but the inner query is returning two row... which is the root couse for your Error.

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

You want to use two  cols in the subquery then you must use EXISTS keyword in the WHERE Clause.

Hope this will resolve your issue.
0
 
LVL 6

Accepted Solution

by:
netjgrnaut earned 425 total points
ID: 37757425
Whoops... forgot to nest a SELECT to get the DupMatch

I've replaced the SQL auto-values with names that parse better (to me, at least)...

SELECT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, CustomerNumber, Name 
FROM  dbo.Customer
WHERE REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
  (SELECT DupMatch FROM
    (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, 
     COUNT(*) AS NumDups
    FROM dbo.Customer
    GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
    HAVING (COUNT(*) > 1)))
ORDER BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', ''), Name

Open in new window


I notice you're replacing empty spaces now, too.  I'd be careful with that.  It will replace *all* the empty spaces.  Are you perhaps trying to replace '  ' with ' ' (two empty spaces with one)?
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757432
...also notice that I've added the actual Name to the results, so you can visually verify that the duplicate match makes sense.
0
 

Author Closing Comment

by:nbtnotes
ID: 37758842
Thanks for the help. Everything is working now.
0

Featured Post

[Webinar] Disaster Recovery and Cloud Management

Learn from Unigma and CloudBerry industry veterans which providers are best for certain use cases and how to lower cloud costs, how to grow your Managed Services practice in IaaS clouds, and how to utilize public cloud for Disaster Recovery

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
Viewers will learn how the fundamental information of how to create a table.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

27 Experts available now in Live!

Get 1:1 Help Now