Link to home
Start Free TrialLog in
Avatar of Putoch
PutochFlag for Afghanistan

asked on

Compairing Data using T-SQL

Hi,
i would like to be able to compair 2 tables against each other to find similar address matches.
I first need to standerdise my data in each table and insert this data into  temp tables.
Like :
(2)  change all data to same case (upper or lower, doesn't matter which).
(3)  strip out all punctuation and all extra spaces
(4)  parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST",     substitute soundex values for uncommon words, etc.  For this task I used the Postal Systems's standard abbreviations.

Can some one help me as to how i can achieve this?
I am using t-SQL on SQL SERVER 2005..
Thanks
Avatar of rickchild
rickchild
Flag of United Kingdom of Great Britain and Northern Ireland image

To try another angle, depending on the size of your organisation you may be able to buy in address matching software, as matching addresses successfully can be a very specialised task.

I have used Capscan Matchcode Batch in the past when I was working in Direct Marketing (Junk Mail!), and I am fairly sure they aupport Irish EIRE data in addition to UK Royal Mail addresses.

Might be worth a trial, they are extremely good pieces of software for address standardisation and matching.

http://www.capscan.com/matchcode_batch.aspx
http://www.qas.co.uk/products/clean-and-suppress-address-data/batch.htm
To go into further detail those types of software can process the data and standardise the addresses to how they are stored in the postal service database. This is often used to make sure mail reaches the address successfully.  If that wasn't enough to help you match then they also assign a Unique ID to each address which you can match on.

You can also extract separate address elements, such as Town, County, House Name, House Number.

They will flag the data on how successful the match to the postal service database file was, either with 1/0, or sometimes a 0 and a percentage of how confident it is in the match.  (For example you could have 2 streets with the same name in a post town, and if the postcode is incomplete it will have to make assumptions.)
Avatar of Putoch

ASKER

Thank you rick, but i am afraid i don't think they will pay for that.

I was hoping there was some way of Formatting this, maybe by using Functions or SP

Has anyone every used anything like this before using t-sql?

thanks,
How is the data currently stored?  As one long address line, or separate columns for address lines?
If it is currently in one long line with say commas your first step is to separate this out into individual address lines, as you would see on an envelope.

Unfortunately I have found in the past when unable to use matching software, the best way to go about it is just match on Postcode and House Number/Name/Address Line 1, and forget those small percentage which fall through.
Sorry to get back to your question anyway!  I just feel strongly about these packages, as I know if you are serious about good address matching it can be weeks/months of work, and then the £2000 software looks a bit more reasonable...


(2)  change all data to same case (upper or lower, doesn't matter which).

SQL has functions UPPER() and LOWER() which you could use in a select or an update.
Note this is not required if your server collation is set to CI (Case-Insensitive)


(3)  strip out all punctuation and all extra spaces

REPLACE() function can be used here.  So for example a common one is to replace double spaces with single spaces.   REPLACE(Column1, '  ', ' ')  Again you use this in a SELECT or UPDATE

(4)  parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST",     substitute soundex values for uncommon words, etc.  For this task I used the Postal Systems's standard abbreviations.

You can again use REPLACE() so say REPLACE(Column1, 'STREET', 'ST')

For the soundex you can see some documentation here.
http://msdn.microsoft.com/en-us/library/ms187384.aspx
Avatar of Putoch

ASKER

Thanks Rick for your advice and help, i will give this a go and see how i fair out.
If You are running SQL 2005 with SMS Bids then you should have the SSIS tool ( SQL Server Integration Services). This  ETL feature of 2005 is amazing once you start using it. The data transformation part of this tool will do most everything you mention above. I think it's worth the time to learn it and the learning curve is pretty low. I created my first packages and was transforming data the same day ...

You might want to check out the info on it. Here's a first step tutorial link that can get you started.

http://msdn.microsoft.com/en-us/library/ms169917.aspx
Avatar of Putoch

ASKER

Hi tshel,
i do indeed have the ssis tool, i have created packages before for extracting data from differnet sources into tables into mssql, and for loading tables with data. but never to do this.
Have you used it for this specifically when you tried it out ( i mean to clean up the data )

thanks for you respnse also!!
Avatar of Putoch

ASKER

Did you use Fuzzy lookups?
Yes  Absolutely, In fact that's the main reason we use it and I have for some pretty major conversion projects.

When Using the Data Transformation portion of BIDS  in the Data Flow task when creating packages I have used many of the transformations to cleanse the data ie: Regular expressions for email matching, changing phone numbers to all numbers vs dashes and parenthesis, Grabbing like titles with Fuzzy lookup & dumping the errors to a seperate files etc etc...... It is a very powerful conversion tool once you get the knack of it...  Here is a pretty good starting step by step article on using the data flow in ssis...

http://dotnetslackers.com/articles/sql/UsingSQLServerIntegrationServicesDataFlowSQLServer2005.aspx

Enjoy!
Thanks
Avatar of Putoch

ASKER

Ok tshel, thanks for the link too, i will have a go at this and let u know how i get on.
ASKER CERTIFIED SOLUTION
Avatar of Tone' Shelby
Tone' Shelby
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
woops my apologies: the how to on fuzzy grouping is ... http://msdn.microsoft.com/en-us/library/ms142155.aspx
again once you get there you will see all the other how to's on just about anything you may need to trnasform... take care.
Hey checking in on you. How are you getting along on this? ...need any more assistance...thanks
Just checking up on how you again are coming along & if you need any additional assistance on this or is your issue resolved? --- thanks!
Avatar of Putoch

ASKER

thanks Tshel, i ended up using some nested queies and pulled back what ever results i could.
The business needed the results very quickly and i did not have the time to develop a technique to pulling and looking through the data. thanks so much for your advise along the way.
regards,
putoch