?
Solved

Detecting junk data

Posted on 2007-11-15
8
Medium Priority
?
422 Views
Last Modified: 2012-05-05
hello all -
our database platform is MS SQL Server 2005, and we have a database containing between 8-10 million records, primarily user data (first name, last name, email, phone, address, etc...)

I need to come up with a way for us to return records that are obviously incorrect data. I'd like to start with the name fields (first/last name), and go on to possibly include address, phone, email, etc...

Does anyone know of any prebuilt SQL functions that test for this invalid name, email, etc...? Or any documentation on the topic? I googled on sql server junk records, sql server data validation, etc... but haven't had a lot of luck

0
Comment
Question by:friskyweasel
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
8 Comments
 
LVL 18

Accepted Solution

by:
Sham Haque earned 1800 total points
ID: 20290662
email is a fairly straightforward and well documented one.
I use this:


			CASE
			WHEN CHARINDEX(' ',LTRIM(RTRIM([Email]))) > 0 THEN 'UNKNOWN@UNKNOWN.COM'-- No embedded spaces
			WHEN LEFT(LTRIM([Email]),1) = '@' THEN 'UNKNOWN@UNKNOWN.COM'  -- '@' can't be the first character of an email address
			WHEN RIGHT(RTRIM([Email]),1) = '.' THEN 'UNKNOWN@UNKNOWN.COM' -- '.' can't be the last character of an email address
			WHEN CHARINDEX('.',[Email],CHARINDEX('@',[Email])) - CHARINDEX('@',[Email]) <=1 THEN 'UNKNOWN@UNKNOWN.COM' -- There must be a '.' after '@'
			WHEN LEN(LTRIM(RTRIM([Email]))) - LEN(REPLACE(LTRIM(RTRIM([Email])),'@','')) > 1 THEN 'UNKNOWN@UNKNOWN.COM'-- Only one '@' sign is allowed
			WHEN CHARINDEX('.',REVERSE(LTRIM(RTRIM([Email])))) < 3 THEN 'UNKNOWN@UNKNOWN.COM'-- Domain name should end with at least 2 character extension
			WHEN (CHARINDEX('.@',[Email]) > 0 AND CHARINDEX('..',[Email]) > 0) THEN 'UNKNOWN@UNKNOWN.COM'-- can't have patterns like '.@' and '..'
			WHEN CHARINDEX('@',[Email]) = 0 THEN 'UNKNOWN@UNKNOWN.COM'
			WHEN [Email] IS NULL THEN 'UNKNOWN@UNKNOWN.COM'
			ELSE [Email]
			END

Open in new window

0
 
LVL 18

Assisted Solution

by:Sham Haque
Sham Haque earned 1800 total points
ID: 20290691
there are several address cleaning services out there - doing matching against electoral registers, credit files etc to produce valid mailable addresses - that would be your best bet.
For UK addresses, I could point you to some companies I've used before...

You can do a check for numbers and invalid punctuation characters in name columns for example:

WHERE PATINDEX('%[0-9][*/,$£%&()"!]%',NameField) > 0
0
 

Author Comment

by:friskyweasel
ID: 20290992
thanks guys for the input - both suggestions sound real good

any ideas on ways to sniff out names that are obviously incorrect, but don't contain oddball characters? in other words, your example should allow me to easily detect:
"!e544$%%^"  as a bad name....

but what if it's: "asdfdfdf"

i'm not sure if there's any known method for detecting this sort of thing or not....



0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 16

Expert Comment

by:ThinkPaper
ID: 20291083
names would be hard to validate.. unless all you're looking for is character checking (i.e. doesn't contain numbers or symbols - besides the apostrophe or dash)..
for this type of thing, look up regular expressions.
0
 
LVL 16

Assisted Solution

by:ThinkPaper
ThinkPaper earned 200 total points
ID: 20291112
i haven't used this before.. but i would think the us post office has an accurate account of addresses..

you may be able to use their db to validate your addresses

http://www.usps.com/ncsc/addressinfo/addressinfomenu.htm?from=zclsearch&page=ais
0
 

Author Comment

by:friskyweasel
ID: 20291434
thanks thinkpaper for the suggestions - those will come in handy

gbshahaq - i wanted to ask if you could go into a little more detail/description of your solution for bad name data:

PATINDEX('%[0-9][*/,$£%&()"!]%',NameField) > 0

so i notice you have 2 sections in brackets [], one for numbers and one for stray characters...i assumed (wrongly i think) that this would evaluate to true if ANY of the conditions were met (contained numbers OR stray characters)....but when i run it as-is above, my results are much more limited than they are if i do this:

PATINDEX('%[0-9]%',NameField) > 0
OR
PATINDEX('%[*/,$£%&()"!]%',NameField) > 0

so i was wondering, is there a way to make this evaluate as an "OR" inside the PATINDEX function without actually having to split it out via sql as i have above? would it take a performance hit by doing it as i am above as opposed to doing everything in the PATINDEX function?
0
 

Author Comment

by:friskyweasel
ID: 20291568
this seems to have accomplished what i was asking above gbshahaq...
PATINDEX('%[0123456789*/,$£%&()"!]%',NameField) > 0

thanks for the help everyone - gbshahag i'm giving you 90% since i used your answers - throwing 10% to thinkpaper for the US postal service suggestion

thanks!
0
 
LVL 18

Expert Comment

by:Sham Haque
ID: 20291732
thanks for the points friskyweasel - glad you got the PATINDEX syntax correct...
do investigate Regular Expressions though - some clever bods have come up with a workable Reg Ex solution in T-SQL - and if you're using 2005, you can use CLR stored procedures - i found one in C# the other day that you could use.
They're useful for matching against patterns of data, such as :
is "asdfdfdf" a pronouncable word?

it gets very complicated though (too much for me)...
0

Featured Post

[Webinar] Lessons on Recovering from Petya

Skyport is working hard to help customers recover from recent attacks, like the Petya worm. This work has brought to light some important lessons. New malware attacks like this can take down your entire environment. Learn from others mistakes on how to prevent Petya like worms.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Microsoft Access is a place to store data within tables and represent this stored data using multiple database objects such as in form of macros, forms, reports, etc. After a MS Access database is created there is need to improve the performance and…
In today's business world, data is more important than ever for informing marketing campaigns. Accessing and using data, however, may not come naturally to some creative marketing professionals. Here are four tips for adapting to wield data for insi…
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question