• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 415
  • Last Modified:

SQL Query to Find Duplicate Customers

I have a table that contains all customers. Over the years we have noticed an increased number of duplicates. I would like to run a query to find these duplicates.

Example Table Column I want to query:

ID - NAME
1 - John Doe Jr Trucking
2 - john doe jr trucking
3 - John Doe Jr. Trucking
4 - ABC Trucking
5 - Avery Shipping, LLC
6 - Avery Shipping LLC

My thinking is I need to strip out all commas, periods and spaces. Then uppercase or lowercase to have the possibility of receiving a duplicate.I would like to have a query that returns ID's 1,2,3,5 and 6. Cuz all of those actually contain duplicates.

Would this be possible ?
0
nbtnotes
Asked:
nbtnotes
  • 7
  • 4
2 Solutions
 
netjgrnautCommented:
Question: is the server/database collation case-insensitive?  Makes a difference in how much work we have to do to find duplicates of differing cases.

Assuming no case sensitivity, this will get you a list of duplicate customer names along with a count of how many times.

select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1

Open in new window


So then building the list is easy...

select * from CustomerTable where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)

Open in new window

0
 
netjgrnautCommented:
The tricky part is knowing which ones are OK to remove.

Assuming that this isn't the only table in your database, other tables that reference customers will (should?) do so using the ID - not the name.

So you'll need to update all such tables to use the ID you're going to keep (to tie all the customer data to a single entry in the customer table) before dropping any duplicates from the customer table.

CAUTION (and be sure to have a backup)
0
 
nbtnotesAuthor Commented:
I won't be removing the duplicates. I will flag those as being archive because they are duplicates. All references will be retained by the ID.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
netjgrnautCommented:
Good plan.

Let me know if the SELECT I provided needs tweaking.  I had to make some assumptions about table and column names, but I think you get the gist...
0
 
netjgrnautCommented:
select replace(replace(Name, ',', ''), '.', ''), * from CustomerTable 
where replace(replace(Name, ',', ''), '.', '') in
(select replace(replace(Name, ',', ''), '.', ''), count(*) from CustomerTable
group by replace(replace(Name, ',', ''), '.', '') having count(*) > 1)
order by replace(replace(Name, ',', ''), '.', ''),  ID

Open in new window


That'll be a bit more readable, and you'll see the "normalized" key used to identify the duplicates.
0
 
nbtnotesAuthor Commented:
I'm new to SQL and want to make sure. The replace function only replace the values for this query and not the actual tables values ?
0
 
netjgrnautCommented:
Yes.

To alter data in a table, you must use UPDATE (or INSERT).

SELECT alone does not make changes.
0
 
nbtnotesAuthor Commented:
The following Query received an error:

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

Open in new window


SQL Query Error
0
 
Lalit ChandraCommented:
You should not return two column within the subQuery as sql is aspecting a single value,but the inner query is returning two row... which is the root couse for your Error.

SELECT     TOP (100) PERCENT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, CustomerNumber
FROM         dbo.Customer
WHERE     (REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
                          (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS Expr1, COUNT(CustomerNumber) AS Expr2
                            FROM          dbo.Customer AS Customer_1
                            GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
                            HAVING      (COUNT(CustomerNumber) > 1)))
ORDER BY Expr1, CustomerNumber

You want to use two  cols in the subquery then you must use EXISTS keyword in the WHERE Clause.

Hope this will resolve your issue.
0
 
netjgrnautCommented:
Whoops... forgot to nest a SELECT to get the DupMatch

I've replaced the SQL auto-values with names that parse better (to me, at least)...

SELECT REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, CustomerNumber, Name 
FROM  dbo.Customer
WHERE REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') IN
  (SELECT DupMatch FROM
    (SELECT     REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '') AS DupMatch, 
     COUNT(*) AS NumDups
    FROM dbo.Customer
    GROUP BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', '')
    HAVING (COUNT(*) > 1)))
ORDER BY REPLACE(REPLACE(REPLACE(Name, ',', ''), '.', ''), ' ', ''), Name

Open in new window


I notice you're replacing empty spaces now, too.  I'd be careful with that.  It will replace *all* the empty spaces.  Are you perhaps trying to replace '  ' with ' ' (two empty spaces with one)?
0
 
netjgrnautCommented:
...also notice that I've added the actual Name to the results, so you can visually verify that the duplicate match makes sense.
0
 
nbtnotesAuthor Commented:
Thanks for the help. Everything is working now.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 7
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now