friskyweasel
asked on
Detecting junk data
hello all -
our database platform is MS SQL Server 2005, and we have a database containing between 8-10 million records, primarily user data (first name, last name, email, phone, address, etc...)
I need to come up with a way for us to return records that are obviously incorrect data. I'd like to start with the name fields (first/last name), and go on to possibly include address, phone, email, etc...
Does anyone know of any prebuilt SQL functions that test for this invalid name, email, etc...? Or any documentation on the topic? I googled on sql server junk records, sql server data validation, etc... but haven't had a lot of luck
our database platform is MS SQL Server 2005, and we have a database containing between 8-10 million records, primarily user data (first name, last name, email, phone, address, etc...)
I need to come up with a way for us to return records that are obviously incorrect data. I'd like to start with the name fields (first/last name), and go on to possibly include address, phone, email, etc...
Does anyone know of any prebuilt SQL functions that test for this invalid name, email, etc...? Or any documentation on the topic? I googled on sql server junk records, sql server data validation, etc... but haven't had a lot of luck
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
names would be hard to validate.. unless all you're looking for is character checking (i.e. doesn't contain numbers or symbols - besides the apostrophe or dash)..
for this type of thing, look up regular expressions.
for this type of thing, look up regular expressions.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thanks thinkpaper for the suggestions - those will come in handy
gbshahaq - i wanted to ask if you could go into a little more detail/description of your solution for bad name data:
PATINDEX('%[0-9][*/,$£% &()"!]%',N ameField) > 0
so i notice you have 2 sections in brackets [], one for numbers and one for stray characters...i assumed (wrongly i think) that this would evaluate to true if ANY of the conditions were met (contained numbers OR stray characters)....but when i run it as-is above, my results are much more limited than they are if i do this:
PATINDEX('%[0-9]%',NameFie ld) > 0
OR
PATINDEX('%[*/,$£%&()"! ]%',NameFi eld) > 0
so i was wondering, is there a way to make this evaluate as an "OR" inside the PATINDEX function without actually having to split it out via sql as i have above? would it take a performance hit by doing it as i am above as opposed to doing everything in the PATINDEX function?
gbshahaq - i wanted to ask if you could go into a little more detail/description of your solution for bad name data:
PATINDEX('%[0-9][*/,$£%
so i notice you have 2 sections in brackets [], one for numbers and one for stray characters...i assumed (wrongly i think) that this would evaluate to true if ANY of the conditions were met (contained numbers OR stray characters)....but when i run it as-is above, my results are much more limited than they are if i do this:
PATINDEX('%[0-9]%',NameFie
OR
PATINDEX('%[*/,$£%&()"!
so i was wondering, is there a way to make this evaluate as an "OR" inside the PATINDEX function without actually having to split it out via sql as i have above? would it take a performance hit by doing it as i am above as opposed to doing everything in the PATINDEX function?
ASKER
this seems to have accomplished what i was asking above gbshahaq...
PATINDEX('%[0123456789*/,$ £%&()"! ]%',NameFi eld) > 0
thanks for the help everyone - gbshahag i'm giving you 90% since i used your answers - throwing 10% to thinkpaper for the US postal service suggestion
thanks!
PATINDEX('%[0123456789*/,$
thanks for the help everyone - gbshahag i'm giving you 90% since i used your answers - throwing 10% to thinkpaper for the US postal service suggestion
thanks!
thanks for the points friskyweasel - glad you got the PATINDEX syntax correct...
do investigate Regular Expressions though - some clever bods have come up with a workable Reg Ex solution in T-SQL - and if you're using 2005, you can use CLR stored procedures - i found one in C# the other day that you could use.
They're useful for matching against patterns of data, such as :
is "asdfdfdf" a pronouncable word?
it gets very complicated though (too much for me)...
do investigate Regular Expressions though - some clever bods have come up with a workable Reg Ex solution in T-SQL - and if you're using 2005, you can use CLR stored procedures - i found one in C# the other day that you could use.
They're useful for matching against patterns of data, such as :
is "asdfdfdf" a pronouncable word?
it gets very complicated though (too much for me)...
ASKER
any ideas on ways to sniff out names that are obviously incorrect, but don't contain oddball characters? in other words, your example should allow me to easily detect:
"!e544$%%^" as a bad name....
but what if it's: "asdfdfdf"
i'm not sure if there's any known method for detecting this sort of thing or not....