Solved

Compairing Data using T-SQL

Posted on 2008-06-18
16
326 Views
Last Modified: 2010-04-21
Hi,
i would like to be able to compair 2 tables against each other to find similar address matches.
I first need to standerdise my data in each table and insert this data into  temp tables.
Like :
(2)  change all data to same case (upper or lower, doesn't matter which).
(3)  strip out all punctuation and all extra spaces
(4)  parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST",     substitute soundex values for uncommon words, etc.  For this task I used the Postal Systems's standard abbreviations.

Can some one help me as to how i can achieve this?
I am using t-SQL on SQL SERVER 2005..
Thanks
0
Comment
Question by:Putoch
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 6
  • 4
16 Comments
 
LVL 13

Expert Comment

by:rickchild
ID: 21814223
To try another angle, depending on the size of your organisation you may be able to buy in address matching software, as matching addresses successfully can be a very specialised task.

I have used Capscan Matchcode Batch in the past when I was working in Direct Marketing (Junk Mail!), and I am fairly sure they aupport Irish EIRE data in addition to UK Royal Mail addresses.

Might be worth a trial, they are extremely good pieces of software for address standardisation and matching.

http://www.capscan.com/matchcode_batch.aspx
http://www.qas.co.uk/products/clean-and-suppress-address-data/batch.htm
0
 
LVL 13

Expert Comment

by:rickchild
ID: 21814301
To go into further detail those types of software can process the data and standardise the addresses to how they are stored in the postal service database. This is often used to make sure mail reaches the address successfully.  If that wasn't enough to help you match then they also assign a Unique ID to each address which you can match on.

You can also extract separate address elements, such as Town, County, House Name, House Number.

They will flag the data on how successful the match to the postal service database file was, either with 1/0, or sometimes a 0 and a percentage of how confident it is in the match.  (For example you could have 2 streets with the same name in a post town, and if the postcode is incomplete it will have to make assumptions.)
0
 

Author Comment

by:Putoch
ID: 21814309
Thank you rick, but i am afraid i don't think they will pay for that.

I was hoping there was some way of Formatting this, maybe by using Functions or SP

Has anyone every used anything like this before using t-sql?

thanks,
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 13

Expert Comment

by:rickchild
ID: 21814352
How is the data currently stored?  As one long address line, or separate columns for address lines?
If it is currently in one long line with say commas your first step is to separate this out into individual address lines, as you would see on an envelope.

Unfortunately I have found in the past when unable to use matching software, the best way to go about it is just match on Postcode and House Number/Name/Address Line 1, and forget those small percentage which fall through.
0
 
LVL 13

Expert Comment

by:rickchild
ID: 21814476
Sorry to get back to your question anyway!  I just feel strongly about these packages, as I know if you are serious about good address matching it can be weeks/months of work, and then the £2000 software looks a bit more reasonable...


(2)  change all data to same case (upper or lower, doesn't matter which).

SQL has functions UPPER() and LOWER() which you could use in a select or an update.
Note this is not required if your server collation is set to CI (Case-Insensitive)


(3)  strip out all punctuation and all extra spaces

REPLACE() function can be used here.  So for example a common one is to replace double spaces with single spaces.   REPLACE(Column1, '  ', ' ')  Again you use this in a SELECT or UPDATE

(4)  parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST",     substitute soundex values for uncommon words, etc.  For this task I used the Postal Systems's standard abbreviations.

You can again use REPLACE() so say REPLACE(Column1, 'STREET', 'ST')

For the soundex you can see some documentation here.
http://msdn.microsoft.com/en-us/library/ms187384.aspx
0
 

Author Comment

by:Putoch
ID: 21817060
Thanks Rick for your advice and help, i will give this a go and see how i fair out.
0
 
LVL 9

Expert Comment

by:Tone' Shelby
ID: 21818286
If You are running SQL 2005 with SMS Bids then you should have the SSIS tool ( SQL Server Integration Services). This  ETL feature of 2005 is amazing once you start using it. The data transformation part of this tool will do most everything you mention above. I think it's worth the time to learn it and the learning curve is pretty low. I created my first packages and was transforming data the same day ...

You might want to check out the info on it. Here's a first step tutorial link that can get you started.

http://msdn.microsoft.com/en-us/library/ms169917.aspx
0
 

Author Comment

by:Putoch
ID: 21818347
Hi tshel,
i do indeed have the ssis tool, i have created packages before for extracting data from differnet sources into tables into mssql, and for loading tables with data. but never to do this.
Have you used it for this specifically when you tried it out ( i mean to clean up the data )

thanks for you respnse also!!
0
 

Author Comment

by:Putoch
ID: 21818402
Did you use Fuzzy lookups?
0
 
LVL 9

Expert Comment

by:Tone' Shelby
ID: 21821135
Yes  Absolutely, In fact that's the main reason we use it and I have for some pretty major conversion projects.

When Using the Data Transformation portion of BIDS  in the Data Flow task when creating packages I have used many of the transformations to cleanse the data ie: Regular expressions for email matching, changing phone numbers to all numbers vs dashes and parenthesis, Grabbing like titles with Fuzzy lookup & dumping the errors to a seperate files etc etc...... It is a very powerful conversion tool once you get the knack of it...  Here is a pretty good starting step by step article on using the data flow in ssis...

http://dotnetslackers.com/articles/sql/UsingSQLServerIntegrationServicesDataFlowSQLServer2005.aspx

Enjoy!
Thanks
0
 

Author Comment

by:Putoch
ID: 21821736
Ok tshel, thanks for the link too, i will have a go at this and let u know how i get on.
0
 
LVL 9

Accepted Solution

by:
Tone' Shelby earned 300 total points
ID: 21827699
My pleasure -- PS the link above gets you atrted on a simple package but if you need How to: specifics MSDN can walk you through - example for how to on fuzzy grouping: http://msdn.microsoft.com/en-us/library/ms137786.aspx
when you get there you will see all kinds of how tos on the left.

and here are all the other available  Integration Services Transformations: http://msdn.microsoft.com/en-us/library/ms141713.aspx

Have Fun!!


0
 
LVL 9

Expert Comment

by:Tone' Shelby
ID: 21827723
woops my apologies: the how to on fuzzy grouping is ... http://msdn.microsoft.com/en-us/library/ms142155.aspx
again once you get there you will see all the other how to's on just about anything you may need to trnasform... take care.
0
 
LVL 9

Expert Comment

by:Tone' Shelby
ID: 22196230
Hey checking in on you. How are you getting along on this? ...need any more assistance...thanks
0
 
LVL 9

Expert Comment

by:Tone' Shelby
ID: 22465673
Just checking up on how you again are coming along & if you need any additional assistance on this or is your issue resolved? --- thanks!
0
 

Author Closing Comment

by:Putoch
ID: 31468414
thanks Tshel, i ended up using some nested queies and pulled back what ever results i could.
The business needed the results very quickly and i did not have the time to develop a technique to pulling and looking through the data. thanks so much for your advise along the way.
regards,
putoch
0

Featured Post

Free eBook: Backup on AWS

Everything you need to know about backup and disaster recovery with AWS, for FREE!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I'm trying, I really am. But I've seen so many wrong approaches involving date(time) boundaries I despair about my inability to explain it. I've seen quite a few recently that define a non-leap year as 364 days, or 366 days and the list goes on. …
I have a large data set and a SSIS package. How can I load this file in multi threading?
Via a live example, show how to shrink a transaction log file down to a reasonable size.
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.

740 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question