Solved

SQL Query to Find Duplicate Customers

Posted on 2012-03-23
12
371 Views
Last Modified: 2012-03-23
I have a table that contains all customers. Over the years we have noticed an increased number of duplicates. I would like to run a query to find these duplicates.

Example Table Column I want to query:

ID - NAME
1 - John Doe Jr Trucking
2 - john doe jr trucking
3 - John Doe Jr. Trucking
4 - ABC Trucking
5 - Avery Shipping, LLC
6 - Avery Shipping LLC

My thinking is I need to strip out all commas, periods and spaces. Then uppercase or lowercase to have the possibility of receiving a duplicate.I would like to have a query that returns ID's 1,2,3,5 and 6. Cuz all of those actually contain duplicates.

Would this be possible ?
0
Comment
Question by:nbtnotes
  • 7
  • 4
12 Comments
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757156
Question: is the server/database collation case-insensitive?  Makes a difference in how much work we have to do to find duplicates of differing cases.

Assuming no case sensitivity, this will get you a list of duplicate customer names along with a count of how many times.

select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1

Open in new window


So then building the list is easy...

select * from CustomerTable where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)

Open in new window

0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757172
The tricky part is knowing which ones are OK to remove.

Assuming that this isn't the only table in your database, other tables that reference customers will (should?) do so using the ID - not the name.

So you'll need to update all such tables to use the ID you're going to keep (to tie all the customer data to a single entry in the customer table) before dropping any duplicates from the customer table.

CAUTION (and be sure to have a backup)
0
 

Author Comment

by:nbtnotes
ID: 37757198
I won't be removing the duplicates. I will flag those as being archive because they are duplicates. All references will be retained by the ID.
0
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757203
Good plan.

Let me know if the SELECT I provided needs tweaking.  I had to make some assumptions about table and column names, but I think you get the gist...
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757214
select replace(replace(Name, ',', ''), '.', ''), * from CustomerTable 
where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)
order by replace(replace(Name, ',', ''), '.', ''),  ID

Open in new window


That'll be a bit more readable, and you'll see the "normalized" key used to identify the duplicates.
0
 

Author Comment

by:nbtnotes
ID: 37757256
I'm new to SQL and want to make sure. The replace function only replace the values for this query and not the actual tables values ?
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757288
Yes.

To alter data in a table, you must use UPDATE (or INSERT).

SELECT alone does not make changes.
0
 

Author Comment

by:nbtnotes
ID: 37757321
The following Query received an error:

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

Open in new window


SQL Query Error
0
 
LVL 7

Assisted Solution

by:Lalit Chandra
Lalit Chandra earned 75 total points
ID: 37757390
You should not return two column within the subQuery as sql is aspecting a single value,but the inner query is returning two row... which is the root couse for your Error.

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

You want to use two  cols in the subquery then you must use EXISTS keyword in the WHERE Clause.

Hope this will resolve your issue.
0
 
LVL 6

Accepted Solution

by:
netjgrnaut earned 425 total points
ID: 37757425
Whoops... forgot to nest a SELECT to get the DupMatch

I've replaced the SQL auto-values with names that parse better (to me, at least)...

SELECT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, CustomerNumber, Name 
FROM  dbo.Customer
WHERE REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
  (SELECT DupMatch FROM
    (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, 
     COUNT(*) AS NumDups
    FROM dbo.Customer
    GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
    HAVING (COUNT(*) > 1)))
ORDER BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', ''), Name

Open in new window


I notice you're replacing empty spaces now, too.  I'd be careful with that.  It will replace *all* the empty spaces.  Are you perhaps trying to replace '  ' with ' ' (two empty spaces with one)?
0
 
LVL 6

Expert Comment

by:netjgrnaut
ID: 37757432
...also notice that I've added the actual Name to the results, so you can visually verify that the duplicate match makes sense.
0
 

Author Closing Comment

by:nbtnotes
ID: 37758842
Thanks for the help. Everything is working now.
0

Featured Post

Three Reasons Why Backup is Strategic

Backup is strategic to your business because your data is strategic to your business. Without backup, your business will fail. This white paper explains why it is vital for you to design and immediately execute a backup strategy to protect 100 percent of your data.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
convert in derived column 7 27
SSIS Conditional Split 7 31
SQL Server: Unable to remove duplicate sets in Header/Detail 6 22
SQL Query assistance 16 21
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

785 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question