Putoch
asked on
Compairing Data using T-SQL
Hi,
i would like to be able to compair 2 tables against each other to find similar address matches.
I first need to standerdise my data in each table and insert this data into temp tables.
Like :
(2) change all data to same case (upper or lower, doesn't matter which).
(3) strip out all punctuation and all extra spaces
(4) parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST", substitute soundex values for uncommon words, etc. For this task I used the Postal Systems's standard abbreviations.
Can some one help me as to how i can achieve this?
I am using t-SQL on SQL SERVER 2005..
Thanks
i would like to be able to compair 2 tables against each other to find similar address matches.
I first need to standerdise my data in each table and insert this data into temp tables.
Like :
(2) change all data to same case (upper or lower, doesn't matter which).
(3) strip out all punctuation and all extra spaces
(4) parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST", substitute soundex values for uncommon words, etc. For this task I used the Postal Systems's standard abbreviations.
Can some one help me as to how i can achieve this?
I am using t-SQL on SQL SERVER 2005..
Thanks
To go into further detail those types of software can process the data and standardise the addresses to how they are stored in the postal service database. This is often used to make sure mail reaches the address successfully. If that wasn't enough to help you match then they also assign a Unique ID to each address which you can match on.
You can also extract separate address elements, such as Town, County, House Name, House Number.
They will flag the data on how successful the match to the postal service database file was, either with 1/0, or sometimes a 0 and a percentage of how confident it is in the match. (For example you could have 2 streets with the same name in a post town, and if the postcode is incomplete it will have to make assumptions.)
You can also extract separate address elements, such as Town, County, House Name, House Number.
They will flag the data on how successful the match to the postal service database file was, either with 1/0, or sometimes a 0 and a percentage of how confident it is in the match. (For example you could have 2 streets with the same name in a post town, and if the postcode is incomplete it will have to make assumptions.)
ASKER
Thank you rick, but i am afraid i don't think they will pay for that.
I was hoping there was some way of Formatting this, maybe by using Functions or SP
Has anyone every used anything like this before using t-sql?
thanks,
I was hoping there was some way of Formatting this, maybe by using Functions or SP
Has anyone every used anything like this before using t-sql?
thanks,
How is the data currently stored? As one long address line, or separate columns for address lines?
If it is currently in one long line with say commas your first step is to separate this out into individual address lines, as you would see on an envelope.
Unfortunately I have found in the past when unable to use matching software, the best way to go about it is just match on Postcode and House Number/Name/Address Line 1, and forget those small percentage which fall through.
If it is currently in one long line with say commas your first step is to separate this out into individual address lines, as you would see on an envelope.
Unfortunately I have found in the past when unable to use matching software, the best way to go about it is just match on Postcode and House Number/Name/Address Line 1, and forget those small percentage which fall through.
Sorry to get back to your question anyway! I just feel strongly about these packages, as I know if you are serious about good address matching it can be weeks/months of work, and then the £2000 software looks a bit more reasonable...
(2) change all data to same case (upper or lower, doesn't matter which).
SQL has functions UPPER() and LOWER() which you could use in a select or an update.
Note this is not required if your server collation is set to CI (Case-Insensitive)
(3) strip out all punctuation and all extra spaces
REPLACE() function can be used here. So for example a common one is to replace double spaces with single spaces. REPLACE(Column1, ' ', ' ') Again you use this in a SELECT or UPDATE
(4) parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST", substitute soundex values for uncommon words, etc. For this task I used the Postal Systems's standard abbreviations.
You can again use REPLACE() so say REPLACE(Column1, 'STREET', 'ST')
For the soundex you can see some documentation here.
http://msdn.microsoft.com/en-us/library/ms187384.aspx
(2) change all data to same case (upper or lower, doesn't matter which).
SQL has functions UPPER() and LOWER() which you could use in a select or an update.
Note this is not required if your server collation is set to CI (Case-Insensitive)
(3) strip out all punctuation and all extra spaces
REPLACE() function can be used here. So for example a common one is to replace double spaces with single spaces. REPLACE(Column1, ' ', ' ') Again you use this in a SELECT or UPDATE
(4) parse the space-delimited elements and standardize all descriptive words such as "NORTH" to "N", "STREET" to "ST", substitute soundex values for uncommon words, etc. For this task I used the Postal Systems's standard abbreviations.
You can again use REPLACE() so say REPLACE(Column1, 'STREET', 'ST')
For the soundex you can see some documentation here.
http://msdn.microsoft.com/en-us/library/ms187384.aspx
ASKER
Thanks Rick for your advice and help, i will give this a go and see how i fair out.
If You are running SQL 2005 with SMS Bids then you should have the SSIS tool ( SQL Server Integration Services). This ETL feature of 2005 is amazing once you start using it. The data transformation part of this tool will do most everything you mention above. I think it's worth the time to learn it and the learning curve is pretty low. I created my first packages and was transforming data the same day ...
You might want to check out the info on it. Here's a first step tutorial link that can get you started.
http://msdn.microsoft.com/en-us/library/ms169917.aspx
You might want to check out the info on it. Here's a first step tutorial link that can get you started.
http://msdn.microsoft.com/en-us/library/ms169917.aspx
ASKER
Hi tshel,
i do indeed have the ssis tool, i have created packages before for extracting data from differnet sources into tables into mssql, and for loading tables with data. but never to do this.
Have you used it for this specifically when you tried it out ( i mean to clean up the data )
thanks for you respnse also!!
i do indeed have the ssis tool, i have created packages before for extracting data from differnet sources into tables into mssql, and for loading tables with data. but never to do this.
Have you used it for this specifically when you tried it out ( i mean to clean up the data )
thanks for you respnse also!!
ASKER
Did you use Fuzzy lookups?
Yes Absolutely, In fact that's the main reason we use it and I have for some pretty major conversion projects.
When Using the Data Transformation portion of BIDS in the Data Flow task when creating packages I have used many of the transformations to cleanse the data ie: Regular expressions for email matching, changing phone numbers to all numbers vs dashes and parenthesis, Grabbing like titles with Fuzzy lookup & dumping the errors to a seperate files etc etc...... It is a very powerful conversion tool once you get the knack of it... Here is a pretty good starting step by step article on using the data flow in ssis...
http://dotnetslackers.com/articles/sql/UsingSQLServerIntegrationServicesDataFlowSQLServer2005.aspx
Enjoy!
Thanks
When Using the Data Transformation portion of BIDS in the Data Flow task when creating packages I have used many of the transformations to cleanse the data ie: Regular expressions for email matching, changing phone numbers to all numbers vs dashes and parenthesis, Grabbing like titles with Fuzzy lookup & dumping the errors to a seperate files etc etc...... It is a very powerful conversion tool once you get the knack of it... Here is a pretty good starting step by step article on using the data flow in ssis...
http://dotnetslackers.com/articles/sql/UsingSQLServerIntegrationServicesDataFlowSQLServer2005.aspx
Enjoy!
Thanks
ASKER
Ok tshel, thanks for the link too, i will have a go at this and let u know how i get on.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
woops my apologies: the how to on fuzzy grouping is ... http://msdn.microsoft.com/en-us/library/ms142155.aspx
again once you get there you will see all the other how to's on just about anything you may need to trnasform... take care.
again once you get there you will see all the other how to's on just about anything you may need to trnasform... take care.
Hey checking in on you. How are you getting along on this? ...need any more assistance...thanks
Just checking up on how you again are coming along & if you need any additional assistance on this or is your issue resolved? --- thanks!
ASKER
thanks Tshel, i ended up using some nested queies and pulled back what ever results i could.
The business needed the results very quickly and i did not have the time to develop a technique to pulling and looking through the data. thanks so much for your advise along the way.
regards,
putoch
The business needed the results very quickly and i did not have the time to develop a technique to pulling and looking through the data. thanks so much for your advise along the way.
regards,
putoch
I have used Capscan Matchcode Batch in the past when I was working in Direct Marketing (Junk Mail!), and I am fairly sure they aupport Irish EIRE data in addition to UK Royal Mail addresses.
Might be worth a trial, they are extremely good pieces of software for address standardisation and matching.
http://www.capscan.com/matchcode_batch.aspx
http://www.qas.co.uk/products/clean-and-suppress-address-data/batch.htm