Solved

Merge and cleanse addresses

Posted on 2014-12-09
27
66 Views
Last Modified: 2015-01-02
Hi,

I have three different data sources with customer information that I've imported into three tables in my sql server database and called:
customer_1
customerID
firstname
lastname
address
zip
city
country

customer_2
customerID
Name
address
zip
city
country

customer_3
customerID
Name
address
zip
city
country

Data from each table looks like this:
customer_1
firstname   lastname    address    zip       city              country
John            Smith          theway 4  12345  Stockholm Sverige

customer_2
Name             address    zip        city                country
Smith, John   theway 4  12345   Stockholm  Sverige

customer_3
Name           address      zip        city               country
John Smith   theway 4   12345  Stockholm   Sverige

How can I merge the three and see so that it get cleansed so that only one instance of the same customer shows up in the result?

Thanks for help!

Peter
0
Comment
Question by:peternordberg
  • 12
  • 7
  • 6
  • +1
27 Comments
 
LVL 45

Expert Comment

by:Vitor Montalvão
Comment Utility
You need to normalize those tables. From those 3, which one you want to keep?
The customer ID it's the same for the 3 tables? i.e. should John Smith has the same customerID in customer_1, customer_2 and customer_3 table?
0
 
LVL 24

Accepted Solution

by:
Phillip Burton earned 500 total points
Comment Utility
Here's the set-up for my example:

Create table customer_1
(firstname nvarchar(20),
lastname nvarchar(20),
address nvarchar(20),
zip nvarchar(20),
[city] nvarchar(20),
country nvarchar(20))

insert into customer_1 values('John','Smith','theway 4','12345','Stockholm','Sverige')

Create table customer_2
([Name] nvarchar(20),
address nvarchar(20),
zip nvarchar(20),
[city] nvarchar(20),
country nvarchar(20))

insert into customer_2 values('Smith, John','theway 4','12345','Stockholm','Sverige')

Create table customer_3
([Name] nvarchar(20),
address nvarchar(20),
zip nvarchar(20),
[city] nvarchar(20),
country nvarchar(20))

insert into customer_3 values('John Smith','theway 4','12345','Stockholm','Sverige')

Open in new window


and here's the query:

select rtrim(ltrim(concat(firstname,' ',lastname))) as [Name], address, zip, [city], country
from customer_1
union
select rtrim(ltrim(substring([Name], charindex(',',[Name])+1,9999))) + ' ' + rtrim(ltrim(substring([Name], 1, charindex(',',[Name])-1))), address, zip, [city], country
from customer_2
union
select [Name], address, zip, [city], country
from customer_3

Open in new window

0
 
LVL 65

Expert Comment

by:Jim Horn
Comment Utility
And while you're at it, define in excrutiatingly clear terms what 'it get cleansed' means, as Data Quality behind demographic info can involve hundreds of validations.
0
 

Author Comment

by:peternordberg
Comment Utility
Hi,

With cleansed I mean no doubles.
I get to keep data from customer_1.
The customerID is not important in this setting. I just want to try and remove those that seem to be the same person and address from the result so that they only occur one time.

Peter
0
 

Author Comment

by:peternordberg
Comment Utility
Hi,

I tried this:

SELECT Name, Adress, Postadress FROM dbo.customer_1 AS c
union
select rtrim(ltrim(substring([Name], charindex(',',[Name])+1,9999))) + ' ' + rtrim(ltrim(substring([Name], 1, charindex(',',[Name])-1))), Adr, Postnr + ' ' + Ort AS postadress
from customer_3
union
select Name, Postadress, Postnr + ' ' + Ort AS postadress
from customer_2

Open in new window


But I get this error:

Msg 537, Level 16, State 3, Line 1
Invalid length parameter passed to the LEFT or SUBSTRING function.

Peter
0
 
LVL 65

Expert Comment

by:Jim Horn
Comment Utility
If all you're trying to do is 'remove duplicates, keeping the customer_1 row whenever possible', then Phillip's final query in his above comment does that as far as SELECTing the rows.

Now tell us if that's not good enough, what you're trying to do:
  Delete all of the rows from all of the tables determined to be a duplicate?
  Create another table that contains all of the unique customer rows?
0
 
LVL 24

Expert Comment

by:Phillip Burton
Comment Utility
Do you have items in the table customer_3 which do NOT have a comma in the Name field?

That might cause that (but you didn't say that in your question).
0
 

Author Comment

by:peternordberg
Comment Utility
Hi,

This is my query and result of how the data looks in customer_2 (sorry I confused you with customer_3 and customer_2 in last post but now it is customer_2 as in my first post)

Sql
I found out that it is the second expression the error comes. The result of the first substring is this:
sql 2
And here is the second with the error:
sql 3
If you can figure out what the error is I would be most grateful.

Peter
0
 

Author Comment

by:peternordberg
Comment Utility
Hi again,

I changed the second expression a little like this:
sql4I then found out that in some of the rows a comma isn't included like you can see in the above image. How can I compensate for that?

Thanks so much for help!

Peter
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
Comment Utility
Without a JOIN field you need to compare all those fields. In the example below I'm assuming that ZIP is a varchar field:
 SELECT 'Customer_2' AS TableName, c2.*
 FROM Customer_1 c1, Customer_2 c2
 WHERE PATINDEX('%'+c1.firstname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.lastname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.address+'%',  c2.address)>0
   AND PATINDEX('%'+c1.zip+'%',  c2.zip)>0
   AND PATINDEX('%'+c1.city+'%',  c2.city)>0
   AND PATINDEX('%'+c1.country+'%',  c2.country)>0
 UNION ALL
 SELECT 'Customer_3' AS TableName, c3.*
 FROM Customer_1 c1, Customer_2 c3
 WHERE PATINDEX('%'+c1.firstname+'%',  c3.name)>0 
   AND PATINDEX('%'+c1.lastname+'%',  c3.name)>0
   AND PATINDEX('%'+c1.address+'%',  c3.address)>0
   AND PATINDEX('%'+c1.zip+'%',  c2.zip)>0
   AND PATINDEX('%'+c1.city+'%',  c3.city)>0
   AND PATINDEX('%'+c1.country+'%',  c3.country)>0

Open in new window

0
 
LVL 24

Expert Comment

by:Phillip Burton
Comment Utility
Change line 3 in your code, Peter, to this

select rtrim(ltrim(substring([Name], charindex(',',[Name])+1,9999))) + ' ' + rtrim(ltrim(substring([Name], 1, charindex(',',[Name])-1))), Adr, Postnr + ' ' + Ort AS postadress

Open in new window


See if that solves the problem. If not, could you please post (in spreadsheet form) the contents of the Name field (none of the others, for confidentiality reasons).
0
 

Author Comment

by:peternordberg
Comment Utility
Hi and thanks for answer,

I tried this query but got the following message:
 SELECT 'Customer_2' AS TableName, c2.*
 FROM Customer_1 c1, Customer_2 c2
 WHERE PATINDEX('%'+c1.firstname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.lastname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.address+'%',  c2.address)>0
   AND PATINDEX('%'+c1.zip+'%',  c2.zip)>0
   AND PATINDEX('%'+c1.city+'%',  c2.city)>0
  
 UNION ALL
 SELECT 'Customer_3' AS TableName, c3.*
 FROM Customer_1 c1, Customer_2 c3
 WHERE PATINDEX('%'+c1.firstname+'%',  c3.name)>0 
   AND PATINDEX('%'+c1.lastname+'%',  c3.name)>0
   AND PATINDEX('%'+c1.address+'%',  c3.address)>0
   AND PATINDEX('%'+c1.zip+'%',  c2.zip)>0
   AND PATINDEX('%'+c1.city+'%',  c3.city)>0

Open in new window



Msg 4104, Level 16, State 1, Line 15
The multi-part identifier "c2.zip" could not be bound.

What can be wrong?

Peter
0
 

Author Comment

by:peternordberg
Comment Utility
Philip Burton,

I changed to this:
select rtrim(ltrim(concat(firstname,' ',lastname))) as [Name], address, zip + ' ' + [city] AS postadress
from customer_1
union
select rtrim(ltrim(substring([Name], charindex(',',[Name])+1,9999))) + ' ' + rtrim(ltrim(substring([Name], 1, charindex(',',[Name])-1))), address, zip + ' ' + city AS postadress
from customer_2
union
select Name, address, zip + ' ' + city AS postadress
from customer_3

Open in new window


Still get this error:

Msg 537, Level 16, State 3, Line 1
Invalid length parameter passed to the LEFT or SUBSTRING function.

Thanks for help!

Peter
0
Control application downtime with dependency maps

Visualize the interdependencies between application components better with Applications Manager's automated application discovery and dependency mapping feature. Resolve performance issues faster by quickly isolating problematic components.

 
LVL 24

Expert Comment

by:Phillip Burton
Comment Utility
Sorry - my copy and paste must have got the wrong line.

I meant to say, try this:

select rtrim(ltrim(substring([Name], charindex(',',concat([Name],','))+1,9999) + ' ' + substring([Name], 1, charindex(',',concat([Name],','))-1))), Adr, Postnr + ' ' + Ort AS postadress

Open in new window

0
 

Author Comment

by:peternordberg
Comment Utility
Hi Philip Burton,

That worked!

But I have now the problem that in one table the zip is looking like 123 45 and in the other 12345. Is it possible to limit the comparison to only name and city or remove spaces from the zip?

Peter
0
 
LVL 24

Expert Comment

by:Phillip Burton
Comment Utility
Then change all occurrences of Postnr to

Replace(Postnr,' ','') as Postnr
0
 

Author Comment

by:peternordberg
Comment Utility
Hi and thanks,

Worked just fine. If I want to remove those zips that contain characters how would I do that?

Peter
0
 
LVL 24

Expert Comment

by:Phillip Burton
Comment Utility
With some difficulty - you probably need a function for that, and that is properly another question, rather than a continuation of this one.
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
Comment Utility
Peter, did you tried my solution?
0
 

Author Comment

by:peternordberg
Comment Utility
Hi Vitor Montalvão,

I did and I answered it above. There was an error I didn't know how to fix.

Peter
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
Comment Utility
Sorry, I missed that.
It was a copy/past issue. Here's the corrected one:
 SELECT 'Customer_2' AS TableName, c2.*
 FROM Customer_1 c1, Customer_2 c2
 WHERE PATINDEX('%'+c1.firstname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.lastname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.address+'%',  c2.address)>0
   AND PATINDEX('%'+c1.zip+'%',  c2.zip)>0
   AND PATINDEX('%'+c1.city+'%',  c2.city)>0
  
 UNION ALL
 SELECT 'Customer_3' AS TableName, c3.*
 FROM Customer_1 c1, Customer_2 c3
 WHERE PATINDEX('%'+c1.firstname+'%',  c3.name)>0 
   AND PATINDEX('%'+c1.lastname+'%',  c3.name)>0
   AND PATINDEX('%'+c1.address+'%',  c3.address)>0
   AND PATINDEX('%'+c1.zip+'%',  c3.zip)>0
   AND PATINDEX('%'+c1.city+'%',  c3.city)>0

Open in new window

0
 

Author Comment

by:peternordberg
Comment Utility
Hi Vitor Montalvão,

I tested it and it executed successfully but without any result???

Peter
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
Comment Utility
Can you post a subset of data that you are working with? I only tested with those 3 records you provided in the main post.
Also I added all columns but you might don't need all of them. For example, do you really need do compare the ZIP code, since address + city + country should give you an accurate information, right?
0
 

Author Comment

by:peternordberg
Comment Utility
Hi

Here is customer_1
firstname      lastname      address      zip      city
Eiford      Wirzén      Vinkelvägen 19      57251      OSKARSHAMN
Kurt      Jönsson      Allévägen 19      29471      SÖLVESBORG
Stig      Radsjö      Skiftesvägen 37 L      90354      UMEÅ
Runa      Gustafsson      Svensgärdegatan 15      50730      BRÄMHULT
Björn      Wahllöf      Sturegatan 16      50342      BORÅS
Peder      Teglund      Box 264      50310      BORÅS
Berit       Johansson      Kyrkvägen 13 A      50731      BRÄMHULT
Desirée      Jonsson      Västra Allén 9      66732      FORSHAGA
Rolf      Andersson      Södra Hällsjövägen 12      77290      GRÄNGESBERG

Customer_2
Name      address      address2      zip      city
Källner, Elisabeth      Härjestad Skattegård 1      NULL      578 91      ANEBY
Gottfridsson, Lorentz      Ärentunavägen 5 P, nb      NULL      743 30      STORVRETA
Sundström, Britt-Marie      Förmansvägen 14, 4 tr      NULL      151 47      SÖDERTÄLJE
Blom, Marit      Tvärgatan 6 B      NULL      782 31      MALUNG
Fredriksson, Rut      Bläsinge Mellangata 5      NULL      380 74      LÖTTORP
Gustavsson, Majvi      Rödakorsgatan 11      NULL      282 34      TYRINGE
Lindstrand, Britt-Marie      Ellefors 7474      NULL      343 96      HALLARYD
Hammar, Per-Olof      Solavägen 28      NULL      571 72      ORMARYD
Semnos Förlag      Att: Joakim      Björkgatan 9      331 51      VÄRNAMO

customer_3
Name      address      address2      Besöksadress      zip      city
Rita Larsson      Aspfjärilgatan 5      NULL      NULL      233 36      Svedala
Rita Möller      Svinnegarn Huseby Säteri 8      NULL      NULL      745 93      Enköping
Rita Nordström      Norrbodavägen 8      NULL      NULL      571 34      Nässjö
Rita Ruff Molnar      Hålsjögatan 41      NULL      NULL      217 66      Malmö
Rita Rågelid      Timmervägen 20      NULL      NULL      737 49      Fagersta
Rita Störsjö      Smidesvägen 4      NULL      NULL      311 36      Falkenberg
Ritja Bloom      Spelmansvägen 3      NULL      NULL      794 32      Orsa
Ritva Nordström      Vårlöksstigen 14      NULL      NULL      561 40      Huskvarna
Ritva Väisänen      Lindgatan 6 A      NULL      NULL      512 52      Svenljunga

Also, you are correct that it would be enough with comparing name, address and city.

Peter
0
 
LVL 24

Expert Comment

by:Phillip Burton
Comment Utility
I've just tested my solution with the new data, and believe it works.
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
Comment Utility
Peter, with those data I can't do nothing since there's no matching records.
Anyway, try to test my query without the ZIP code comparison since you won't need it:
SELECT 'Customer_2' AS TableName, c2.*
FROM Customer_1 c1, Customer_2 c2
WHERE PATINDEX('%'+c1.firstname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.lastname+'%',  c2.name)>0
   AND PATINDEX('%'+c1.address+'%',  c2.address)>0
   AND PATINDEX('%'+c1.city+'%',  c2.city)>0
   AND PATINDEX('%'+c1.country+'%',  c2.country)>0
UNION ALL
SELECT 'Customer_3' AS TableName, c3.*
FROM Customer_1 c1, Customer_2 c3
WHERE PATINDEX('%'+c1.firstname+'%',  c3.name)>0 
   AND PATINDEX('%'+c1.lastname+'%',  c3.name)>0
   AND PATINDEX('%'+c1.address+'%',  c3.address)>0
   AND PATINDEX('%'+c1.city+'%',  c3.city)>0
   AND PATINDEX('%'+c1.country+'%',  c3.country)>0

Open in new window

0
 

Author Closing Comment

by:peternordberg
Comment Utility
I'm sorry for the late award points. I went on vacation for christmas without being able to update it.

Thanks for help!

Peter
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

If you find yourself in this situation “I have used SELECT DISTINCT but I’m getting duplicates” then I'm sorry to say you are using the wrong SQL technique as it only does one thing which is: produces whole rows that are unique. If the results you a…
Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now