JGH5
asked on
Replace Alpha and Numeric patterns with X and 9 dynamically based on a look up table for the entire database
I am trying to replace columns in a database with either a numeric when a number pattern is found and an alpha when an alpha pattern is found. I have several columns in the [TestDB] such as [LastName], [FirstName] and [TelephoneNum].
I need to add Something like Replace((Replace('X','[A-Z ]')), '9','[0-9]'))) where [LastName] and [FirstName] would result with XXXXX and 99999 for [TelephoneNum].
Furthermore, I have a list of columns in a table I need to find and then replace with an X or 9.
Is there a way to use the FindText function and query a table with targeted columns and replace with X or 9.
I use this Stored Procedure to find columns one at a time but need to come up with a better solution to find multiple columns based on a look up table and then replace the values based on the alpha or numeric pattern found in that particular column:
Example Input Parameter: exec sp_FindText '%LastName%', '[TEST_DB]'
I need to add Something like Replace((Replace('X','[A-Z
Furthermore, I have a list of columns in a table I need to find and then replace with an X or 9.
Is there a way to use the FindText function and query a table with targeted columns and replace with X or 9.
I use this Stored Procedure to find columns one at a time but need to come up with a better solution to find multiple columns based on a look up table and then replace the values based on the alpha or numeric pattern found in that particular column:
Example Input Parameter: exec sp_FindText '%LastName%', '[TEST_DB]'
IF EXISTS (SELECT name FROM sysobjects WHERE name = N'sp_FindText' AND type = 'P')
DROP PROCEDURE sp_FindText
go
CREATE PROCEDURE [dbo].sp_FindText
@text varchar(255)
, @dbname varchar(64) = NULL
AS
SET NOCOUNT ON
IF @dbname IS NULL
BEGIN
--enumerate all databases
DECLARE #db CURSOR FOR
SELECT [name] FROM master..sysdatabases
declare @c_dbname varchar(64)
OPEN #db FETCH #db INTO @c_dbname
WHILE @@FETCH_STATUS <> -1 --and @MyCount < 500
BEGIN
execute sp_FindText @text, @c_dbname
FETCH #db INTO @c_dbname
END
CLOSE #db DEALLOCATE #db
END --if @dbname is null
ELSE
BEGIN --@dbname is not null
declare @sql varchar(250)
--create the find like command
SELECT @sql = 'SELECT ''' + @dbname + ''' as db, o.name, m.definition '
SELECT @sql = @sql + ' FROM ' + @dbname + '.sys.sql_modules m '
SELECT @sql = @sql + ' INNER JOIN ' + @dbname + '..sysobjects o on m.object_id=o.id'
SELECT @sql = @sql + ' WHERE [definition] like ''%' + @text + '%'''
print @sql
EXEC (@sql)
END --@dbname is not null
go
Oh, one more comment to the code. I'm actually looking up user_type_ID datatypes as this is a better fit for our data. If you want a more generic lookup on the system datatypes, modify the code in the "create dynamic SQL Query to loop through each of the databases" section to link the system_type_id's rather than the user_type_id's in the sys.types and sys.columns tables.
I've read the original q three times and I'm still not sure what you want to do, but here's my best guess.
For a given list of columns in a table (or tables?? or entire db's tables??), you want to replace the current values in those columns as follows:
1) if the current value in a given byte location is A-Z (and a-z!?), replace it with X.
2) if the current value in a given byte location is 0-9, replace it with 9.
(so result will look kinda like a COBOL PIC clause :-)).
Thus, 'ABC1' would become 'XXX9', and '3/2B/4DE1' would become '9/9X/9XX9'.
Is that right?
For a given list of columns in a table (or tables?? or entire db's tables??), you want to replace the current values in those columns as follows:
1) if the current value in a given byte location is A-Z (and a-z!?), replace it with X.
2) if the current value in a given byte location is 0-9, replace it with 9.
(so result will look kinda like a COBOL PIC clause :-)).
Thus, 'ABC1' would become 'XXX9', and '3/2B/4DE1' would become '9/9X/9XX9'.
Is that right?
ASKER
Hi Scott,
Yes, this is needed to de-identify production data to use for testing.
Yes, this is needed to de-identify production data to use for testing.
ASKER
On a simpilar note something like this where i hard code the column and table:
UPDATE
[Customer]
SET
CustID =
(
Case when CustID like '%[A-Z]%' then 'X'
Case when CustID like '%[a-z]%' then 'x'
Case when CustID Like '%[0-1]%' then '2'
Case when CustID Like '%[2-3]%' then '4'
Case when CustID Like '%[4-5]%' then '6'
Case when CustID Like '%[5-6]%' then '7'
Case when CustID Like '%[7-8]%' then '9'
ELSE
CustID
END
)
UPDATE
[Customer]
SET
CustID =
(
Case when CustID like '%[A-Z]%' then 'X'
Case when CustID like '%[a-z]%' then 'x'
Case when CustID Like '%[0-1]%' then '2'
Case when CustID Like '%[2-3]%' then '4'
Case when CustID Like '%[4-5]%' then '6'
Case when CustID Like '%[5-6]%' then '7'
Case when CustID Like '%[7-8]%' then '9'
ELSE
CustID
END
)
Hi JGHS,
Do you need to replace ALL characters, or is it simply enough to replace any string with a single X, and any number with a single 9?
If so, this will do the trick:
I've modified the earlier procedure I put up. It now lists all the tables and columns from non-system databases that contain either number or string fields, based on their system_type_id's. The list at the top shows which ID's are included and what they are replaced with. You can modify the second dynamic SQL statement to include more or exclude some of these types by adding/removing them from the "IN" statement.
The procedure will loop through all your databases and list the table names, column names and data ID's for the selected data types. It will then generate update statements for each of those columns and replace number values with 9, and text/string values with X.
Make sure to replace the database name at the very top of the statement to your database name. I've also disabled the Exec (@SQL) command for the update statement to prevent accidental update. Instead it will print the generated command for review.
Let me know if this is what you need, or if further modifications are required.
Do you need to replace ALL characters, or is it simply enough to replace any string with a single X, and any number with a single 9?
If so, this will do the trick:
USE [ExpertExchange]
GO
/****** Object: StoredProcedure [dbo].[uSP_anomymize_data] Script Date: 11/05/2013 09:50:41 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Alter Procedure [dbo].[uSP_anomymize_data]
/*
--Replaces string fields with X and number fields with 9
--Check is done by datatype:
name system_type_id Replaced_with
text 35 X
tinyint 48 9
smallint 52 9
int 56 9
float 62 9
ntext 99 X
decimal 106 9
numeric 108 9
bigint 127 9
varchar 167 X
char 175 X
nvarchar 231 X
nchar 239 X
sysname 231 X
*/
AS
Set nocount on
--Declare variables
Declare @SQL nvarchar(max)
, @Table_name nvarchar(100)
, @column_name nvarchar(100)
, @database_name nvarchar(100)
, @data_type int
--Declare table variable to hold database names
Declare @V_databases table
(Database_name nvarchar(100))
--Insert database names into table variable
Insert into @V_databases
Select name
from sys.databases
where Database_ID > 4 --Excludes system tables
--Create table to hold all table and column names for included data_type
Create table #DataToUpdate
( Database_name nvarchar(100)
, Table_name nvarchar(100)
, column_name nvarchar(100)
, data_type int
)
--Insert all table and column names for included data_type
--Declare cursor to check each database
Declare T_databases cursor for
Select Database_name
from @V_databases
Order by Database_name
Open T_databases
Fetch Next from T_databases
into
@database_name
While (@@FETCH_STATUS <> -1)
Begin
-- Create dynamic SQL query to loop through each of the databases and insert relevant records in temp table
Set @SQL =
'Insert into #DataToUpdate
Select ' + '''' + @database_name + ''' ,
o.name as ''Table'' ,
c.name as ''Column_name'' ,
t.system_type_id as ''Data_type''
from ' + '' + @Database_name + '.sys.types t
inner join ' + '' + @Database_name + '.sys.columns c
on t.system_type_id = c.system_type_id
inner join ' + '' + @Database_name + '.sys.objects o
on c.object_id = o.object_id
and o.type = ''U''
Where t.system_type_ID in (35,99,167,231,239,175,48,52,56,62,106,108,127)'
-- For testing, print @SQL
-- print @SQL
-- Execute the SQL query
Exec (@SQL)
-- Clear @SQL variable for next query
Set @SQL = ''
Fetch Next from T_databases
into
@database_name
End
Close T_databases
Deallocate T_databases
--Clear @database_name variable
Set @database_name = ''
--Create cursor which loops through all selected tables creates update statements for the columns
Declare T_Datacheck cursor for
Select Database_name
, Table_name
, Column_name
, Data_type
From #DataToUpdate
Order by Table_name
Open T_Datacheck
Fetch next from T_Datacheck
Into
@Database_name
, @Table_name
, @column_name
, @Data_type
While (@@FETCH_STATUS) <> -1
Begin
--Dynamic SQL statement to create the update statements
Set @SQL = 'Update ' + @database_name + '.dbo.' + @table_name + '
Set ' + @column_name + ' = Case
When ' + ltrim(rtrim(str(@data_type,4,0))) + ' in (48,52,56,62,106,108,127)
Then 9
When ' + ltrim(rtrim(str(@data_type,4,0))) + ' in (35,99,167,231,239,175)
Then ''X''
Else ' + @column_name + '
End'
-- For testing purposes, print @SQL
Print @SQL
--Execute the dynamic SQL to insert possible values into the final dataset
-- Exec (@SQL)
--Clear the dynamic SQL variable for next loop
Set @SQL = ''
Fetch next from T_Datacheck
Into
@Database_name
, @Table_name
, @column_name
, @Data_type
End
Close T_Datacheck
Deallocate T_Datacheck
--Drop temp table
Drop table #DataToUpdate
Set nocount off
GO
I've modified the earlier procedure I put up. It now lists all the tables and columns from non-system databases that contain either number or string fields, based on their system_type_id's. The list at the top shows which ID's are included and what they are replaced with. You can modify the second dynamic SQL statement to include more or exclude some of these types by adding/removing them from the "IN" statement.
The procedure will loop through all your databases and list the table names, column names and data ID's for the selected data types. It will then generate update statements for each of those columns and replace number values with 9, and text/string values with X.
Make sure to replace the database name at the very top of the statement to your database name. I've also disabled the Exec (@SQL) command for the update statement to prevent accidental update. Instead it will print the generated command for review.
Let me know if this is what you need, or if further modifications are required.
Oh, and of course the "ALTER Procedure" statement should be changed tp "CREATE Procedure"....
ASKER
I need to replace all of the characters or numbers in some of the tables and columns in the database.
When you say "some tables and columns", does that mean you have a fixed list? Does it mean you're not looking to replace ALL text and numbers in your database, but only some of it? If there are no obvious selection criteria like datatypes, how do you provide your query with that information? Some sample data would be useful.
ASKER
Need to change all data in these columns (small example set) and they have different data types and all of the information in the Column/Row has to be replaced with repeating X or 9 to be able to view the obfascated data with the same size after the databases changes are made.
Table Schema Column
Agent dbo LastName
Agent dbo FirstName
Agent dbo CorpName
Agent dbo DateOfBirth
Agent dbo SSN
Agent dbo TIN
AgentAddress dbo AddrLine1
AgentAddress dbo AddrLine2
AgentAddress dbo City
Table Schema Column
Agent dbo LastName
Agent dbo FirstName
Agent dbo CorpName
Agent dbo DateOfBirth
Agent dbo SSN
Agent dbo TIN
AgentAddress dbo AddrLine1
AgentAddress dbo AddrLine2
AgentAddress dbo City
First let's try the simplest method: a string of embedded REPLACEs. If that performs OK, we're good. If not, we'll have to try to do something else that can be made to run faster.
The basic statement would be:
SELECT
column_data,
REPLACE(REPLACE(REPLACE(RE PLACE(REPL ACE(REPLAC E(REPLACE( REPLACE(RE PLACE(REPL ACE(
REPLACE(REPLACE(REPLACE(RE PLACE(REPL ACE(REPLAC E(REPLACE( REPLACE(RE PLACE(REPL ACE(
REPLACE(REPLACE(REPLACE(RE PLACE(REPL ACE(REPLAC E(
REPLACE(REPLACE(REPLACE(RE PLACE(REPL ACE(REPLAC E(REPLACE( REPLACE(RE PLACE(REPL ACE(
column_data,
'a', 'X'),'b', 'X'),'c', 'X'),'d', 'X'),'e', 'X'),'f', 'X'),'g', 'X'),'h', 'X'),'i', 'X'),'j', 'X'),
'k', 'X'),'l', 'X'),'m', 'X'),'n', 'X'),'o', 'X'),'p', 'X'),'q', 'X'),'r', 'X'),'s', 'X'),'t', 'X'),
'u', 'X'),'v', 'X'),'w', 'X'),'x', 'X'),'y', 'X'),'z', 'X'),
'0', '9'),'1', '9'),'2', '9'),'3', '9'),'4', '9'),'5', '9'),'6', '9'),'7', '9'),'8', '9'),'9', '9')
FROM (
SELECT 'afl;kj0[-2/;3bnlkrjq _nrt98a[-w0jaZsv' AS column_data
) AS test_data
Then we just have to put that into a proc and have it UPDATE *every* applicable column on a table in *one* UPDATE statement. While that will generate more log activity at once, it will be vastly more efficient than multiple passes thru the same table, which will be a performance nightmare on large tables.
If you'd like sample code on doing that within a proc, just let me know.
The basic statement would be:
SELECT
column_data,
REPLACE(REPLACE(REPLACE(RE
REPLACE(REPLACE(REPLACE(RE
REPLACE(REPLACE(REPLACE(RE
REPLACE(REPLACE(REPLACE(RE
column_data,
'a', 'X'),'b', 'X'),'c', 'X'),'d', 'X'),'e', 'X'),'f', 'X'),'g', 'X'),'h', 'X'),'i', 'X'),'j', 'X'),
'k', 'X'),'l', 'X'),'m', 'X'),'n', 'X'),'o', 'X'),'p', 'X'),'q', 'X'),'r', 'X'),'s', 'X'),'t', 'X'),
'u', 'X'),'v', 'X'),'w', 'X'),'x', 'X'),'y', 'X'),'z', 'X'),
'0', '9'),'1', '9'),'2', '9'),'3', '9'),'4', '9'),'5', '9'),'6', '9'),'7', '9'),'8', '9'),'9', '9')
FROM (
SELECT 'afl;kj0[-2/;3bnlkrjq _nrt98a[-w0jaZsv' AS column_data
) AS test_data
Then we just have to put that into a proc and have it UPDATE *every* applicable column on a table in *one* UPDATE statement. While that will generate more log activity at once, it will be vastly more efficient than multiple passes thru the same table, which will be a performance nightmare on large tables.
If you'd like sample code on doing that within a proc, just let me know.
ASKER
i tried running the replace script on one table and it did not make the changes. could you show me a proc with this instead?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you for your help, the Update script works.
I wrote below stored procedure a while ago to check for specific values throughout our database. I think it's pretty close to what you are looking for, although perhaps not a 100% fit. But it should probably get you started.
The procedure takes a datatype and a data value as input parameters. It will first check which tables contain columns that use that particular datatype and stores that in a temp table. After that a cursor is used to check for each of the columns with the correct datatype whether they contain the data value provided. This is stored in a second temp table which is the final result set.
I'm pretty sure you could rewrite this slightly to look up the more generic values you're looking for. You can use the final result set in the #DataExists table to run your replace statements.
Open in new window
Let me know if this is clear or whether more explanation is required. As an example, the final data set would look something like the attached screenshot.
ExampleData.png