Solved

SQL Query to Find Duplicate Customers

Posted on 2012-03-23
12
358 Views
Last Modified: 2012-03-23
I have a table that contains all customers. Over the years we have noticed an increased number of duplicates. I would like to run a query to find these duplicates.

Example Table Column I want to query:

ID - NAME
1 - John Doe Jr Trucking
2 - john doe jr trucking
3 - John Doe Jr. Trucking
4 - ABC Trucking
5 - Avery Shipping, LLC
6 - Avery Shipping LLC

My thinking is I need to strip out all commas, periods and spaces. Then uppercase or lowercase to have the possibility of receiving a duplicate.I would like to have a query that returns ID's 1,2,3,5 and 6. Cuz all of those actually contain duplicates.

Would this be possible ?
0
Comment
Question by:nbtnotes
  • 7
  • 4
12 Comments
 
LVL 6

Expert Comment

by:netjgrnaut
Comment Utility
Question: is the server/database collation case-insensitive?  Makes a difference in how much work we have to do to find duplicates of differing cases.

Assuming no case sensitivity, this will get you a list of duplicate customer names along with a count of how many times.

select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1

Open in new window


So then building the list is easy...

select * from CustomerTable where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)

Open in new window

0
 
LVL 6

Expert Comment

by:netjgrnaut
Comment Utility
The tricky part is knowing which ones are OK to remove.

Assuming that this isn't the only table in your database, other tables that reference customers will (should?) do so using the ID - not the name.

So you'll need to update all such tables to use the ID you're going to keep (to tie all the customer data to a single entry in the customer table) before dropping any duplicates from the customer table.

CAUTION (and be sure to have a backup)
0
 

Author Comment

by:nbtnotes
Comment Utility
I won't be removing the duplicates. I will flag those as being archive because they are duplicates. All references will be retained by the ID.
0
 
LVL 6

Expert Comment

by:netjgrnaut
Comment Utility
Good plan.

Let me know if the SELECT I provided needs tweaking.  I had to make some assumptions about table and column names, but I think you get the gist...
0
 
LVL 6

Expert Comment

by:netjgrnaut
Comment Utility
select replace(replace(Name, ',', ''), '.', ''), * from CustomerTable 
where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)
order by replace(replace(Name, ',', ''), '.', ''),  ID

Open in new window


That'll be a bit more readable, and you'll see the "normalized" key used to identify the duplicates.
0
 

Author Comment

by:nbtnotes
Comment Utility
I'm new to SQL and want to make sure. The replace function only replace the values for this query and not the actual tables values ?
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 6

Expert Comment

by:netjgrnaut
Comment Utility
Yes.

To alter data in a table, you must use UPDATE (or INSERT).

SELECT alone does not make changes.
0
 

Author Comment

by:nbtnotes
Comment Utility
The following Query received an error:

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

Open in new window


SQL Query Error
0
 
LVL 7

Assisted Solution

by:Lalit Chandra
Lalit Chandra earned 75 total points
Comment Utility
You should not return two column within the subQuery as sql is aspecting a single value,but the inner query is returning two row... which is the root couse for your Error.

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

You want to use two  cols in the subquery then you must use EXISTS keyword in the WHERE Clause.

Hope this will resolve your issue.
0
 
LVL 6

Accepted Solution

by:
netjgrnaut earned 425 total points
Comment Utility
Whoops... forgot to nest a SELECT to get the DupMatch

I've replaced the SQL auto-values with names that parse better (to me, at least)...

SELECT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, CustomerNumber, Name 
FROM  dbo.Customer
WHERE REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
  (SELECT DupMatch FROM
    (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, 
     COUNT(*) AS NumDups
    FROM dbo.Customer
    GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
    HAVING (COUNT(*) > 1)))
ORDER BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', ''), Name

Open in new window


I notice you're replacing empty spaces now, too.  I'd be careful with that.  It will replace *all* the empty spaces.  Are you perhaps trying to replace '  ' with ' ' (two empty spaces with one)?
0
 
LVL 6

Expert Comment

by:netjgrnaut
Comment Utility
...also notice that I've added the actual Name to the results, so you can visually verify that the duplicate match makes sense.
0
 

Author Closing Comment

by:nbtnotes
Comment Utility
Thanks for the help. Everything is working now.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Help with SQL Query 23 39
C# SQL BULK INSERT CLASS 5 33
sql server query? 6 26
Numeric sequence in SQL 14 36
Nowadays, some of developer are too much worried about data. Who is using data, who is updating it etc. etc. Because, data is more costlier in term of money and information. So security of data is focusing concern in days. Lets' understand the Au…
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
This video shows, step by step, how to configure Oracle Heterogeneous Services via the Generic Gateway Agent in order to make a connection from an Oracle session and access a remote SQL Server database table.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now