Need to scramble data in a column SQL Server

bgarib
bgarib used Ask the Experts™
on
I'm looking for a function to scramble or randomize a column in my existing database. What I mean by this is that I have a few thousand records of patients. I want to randomize their last names so that the last names do not match the rest of the record so they can't be directly identified. I'll do a similar process with other columns (first name, date of birth, etc.) but what I need is a way to swap out the names so the last name is written to a random record.

Thanks!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Is this so that you can get a sample data set which doesn't match the real world?

Post the table definition and we can come up with something. Does the table have a primary key (IDENTITY)? That will make things a bit easier to scramble.

Author

Commented:
Yes, we want to not give out protected health information that can uniquely identify a patient. It's for a demo database where we need to show some pseudo-real data. For all sakes and purposes the table structure simplified is:

PatientID                  numeric        (primary key)
PatientLastName     varchar

Thanks
Is patientID a numerically ascending number with no holes in it?

Does it have to be guaranteed that every single person has to have their surname swapped? Is it sufficient to just shift all the surnames down by say four records?

I just want to get the whole story before I write any code.
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
The PatientID is numerically ascending but I can't be guaranteed that there are no holes in it.

Yes, every person has to have the name swapped.

I also entertained shifting all the names down then just manually fixing the ones at the top. This sounds like not as good a way but I would use it if I had to.

Thanks!
Senior DBA
Most Valuable Expert 2018
Top Expert 2014
Commented:
I'm pretty sure this will do it, including making sure that an entry does not accidentally scramble back to itself (what a pain that was! :-) ).

It uses two work tables.  I haven't tried to improve it any performance-wise (for a few thousand rows it shouldn't matter much), just tried to get it working :-).

To try it with your real data, replace #patientTable with your actual table name (in the first INSERT and the last UPDATE).


SET NOCOUNT ON
-- create test version of main data table
IF OBJECT_ID('tempdb.dbo.#patientTable') IS NOT NULL
    DROP TABLE #patientTable
CREATE TABLE #patientTable (
    patientId NUMERIC,
    patientLastName VARCHAR(4)
    )
INSERT INTO #patientTable VALUES(1,'A')
INSERT INTO #patientTable VALUES(3,'B')
INSERT INTO #patientTable VALUES(11,'C')
INSERT INTO #patientTable VALUES(27,'D')
INSERT INTO #patientTable VALUES(59,'E')
INSERT INTO #patientTable VALUES(66,'F')
INSERT INTO #patientTable VALUES(82,'G')
INSERT INTO #patientTable VALUES(88,'H')
INSERT INTO #patientTable VALUES(91,'I')
INSERT INTO #patientTable VALUES(94,'J')

SET NOCOUNT OFF
PRINT '--Original data.'
SELECT * FROM #patientTable
SET NOCOUNT ON

-- create work tables to be used to do scramble
IF OBJECT_ID('tempdb.dbo.#patientIds') IS NOT NULL
    DROP TABLE #patientIds
IF OBJECT_ID('tempdb.dbo.#patientScramble1') IS NOT NULL
    DROP TABLE #patientScramble1
CREATE TABLE #patientIds (
    ident INT IDENTITY(1,1),
    patientId NUMERIC,
    UNIQUE CLUSTERED (ident)
    )
CREATE TABLE #patientScramble1 (
    ident INT IDENTITY(1,1),
    patientIdent INT,
    uniqIdent UNIQUEIDENTIFIER,
    UNIQUE NONCLUSTERED (ident)
    )

DECLARE @maxPatientIdent INT

INSERT INTO #patientIds
SELECT patientId
FROM #patientTable
ORDER BY patientId
SET @maxPatientIdent = SCOPE_IDENTITY()

INSERT INTO #patientScramble1
SELECT ident AS patientIdent, NEWID() AS uniqIdent
FROM #patientIds
ORDER BY uniqIdent
 
-- make sure no patient's scramble slot is pointing to its own original slot
WHILE EXISTS(SELECT 1 FROM #patientScramble1 WHERE ident = patientIdent)
BEGIN
    UPDATE patScram1
    SET patientIdent = CASE WHEN patScram1.ident = patIdInSameSlot.ident
        THEN (SELECT patientIdent FROM #patientScramble1 WHERE #patientScramble1.ident =
            CASE WHEN patIdInSameSlot.ident = @maxPatientIdent THEN 1 ELSE
                patIdInSameSlot.ident + 1 END)
        ELSE patIdInSameSlot.ident END
    FROM #patientScramble1 patScram1
    INNER JOIN (
        SELECT ident
        FROM #patientScramble1 ps1
        WHERE ident = patientIdent AND NOT EXISTS (
            SELECT 1
            FROM #patientScramble1 ps2
            WHERE ps2.ident = ps2.patientIdent AND ps2.ident = ps1.ident + 1 )
    ) AS patIdInSameSlot ON patScram1.ident IN (
        SELECT patIdInSameSlot.ident UNION ALL
        SELECT CASE WHEN patIdInSameSlot.ident = @maxPatientIdent THEN 1 ELSE
            (patIdInSameSlot.ident + 1) END)
END

-- scramble the data!
UPDATE old
SET patientLastName = new.patientLastName
FROM #patientScramble1 patScram1
INNER JOIN #patientIds patIdsOld ON patScram1.ident = patIdsOld.ident
INNER JOIN #patientIds patIdsNew ON patScram1.patientIdent = patIdsNew.ident
INNER JOIN #patientTable old ON old.patientId = patIdsOld.patientId
INNER JOIN #patientTable new ON new.patientId = patIdsNew.patientId

SET NOCOUNT OFF
PRINT '--Scrambled data (last name).'
SELECT * FROM #patientTable ORDER BY patientId

Author

Commented:
Sorry for the long delay. I couldn't get that query to work. I was able to sucessfully scramble data using an MS Access tool I found online if it helps anyone.

http://alexdyb.blogspot.com/2007/03/how-to-scramble-data.html

If I don't have to worry about "duplicate" data (i.e. key fields) then I found the following has worked well for me .  It has the added bonus of retaining the data formatting (i.e. Spaces, Capitals, Punctuation, etc) and it is realatively fast even against large data sets:

--Example Usage:
update myTable_
      set social_security_number_ = dbo.p_ScrambleText_(social_security_number_)
         ,name_last_ = dbo.p_ScrambleText_(name_last_)



if exists (select 1
            from  sysobjects
           where  id = object_id('v_RandomNumber_')
            and   type = 'V')
   drop view v_RandomNumber_
go
 
CREATE VIEW v_RandomNumber_
AS
SELECT RAND() as RandomNumber_
GO
 
if exists (select 1
          from sysobjects
          where  id = object_id('p_ScrambleText_')
          and type in ('IF', 'FN', 'TF'))
   drop function p_ScrambleText_
go
 
CREATE FUNCTION p_ScrambleText_(@Value_ varchar(255))
RETURNS varchar(255)
AS
BEGIN
 
	DECLARE @Length_ int
	DECLARE @RetVal_ varchar(255)
	DECLARE @Counter_ smallint
	DECLARE @RandomNumber_ float
	DECLARE @CurChar_ as char(1)
    DECLARE @ASCIIValue_ as smallint
	SET @Length_ = LEN(@Value_)
	SET @RetVal_ = ''
    SET @ASCIIValue_ = 0
 
	SET @Counter_ = 1
	WHILE @Counter_ < (@Length_ + 1)
	BEGIN
        --Process next available character 
		SET @CurChar_ = SUBSTRING(@Value_, @Counter_, 1)
        --Get the ASCII value of the character we are processing
        SET @ASCIIValue_ = ASCII(@CurChar_)
        --Generate a new random number (remeber this generates a number between 0 and 1)
        SET @RandomNumber_ = (select RandomNumber_ from v_RandomNumber_)
       --Using the ASCII value see if we are working with a number 
		IF @ASCIIValue_ between 48 and 57 
		BEGIN
            --Use the random number to get a value between 0 and 9 to be used as the random replacement number
			SET @CurChar_ = Convert(Char,Convert(tinyint, 9 * @RandomNumber_)) 
        END
        --if not a number check for an uppercase letter
        ELSE IF @ASCIIValue_ between 65 and 90
        BEGIN
            --Now use the random number to get a value between 1 and 26 which we will use to generate a random replacement letter of the same case
			SET @CurChar_ = CHAR(Convert(tinyint, (25 * @RandomNumber_ + 1)) + 63) 
        END
        --If not an uppercase letter check for a lowercase letter
		ELSE IF @ASCIIValue_ between 97 and 122 
        BEGIN
            --Now use the random number to get a value between 1 and 26 which we will use to generate a random replacement letter of the same case
			SET @CurChar_ = CHAR(Convert(tinyint, (25 * @RandomNumber_ + 1)) + 96) 
		END 
        --NOTE: If the character being processed is not a number value or a letter (A-Z or a-Z) then we keep that character
        --      This helps preserve formatting found with data like Phone Number, SSN's, emails etc.
        SET @RetVal_ = @RetVal_ + @CurChar_
		SET @Counter_ = @Counter_ + 1 
	END 
 
	RETURN @RetVal_ 
END
GO 
if exists (select 1
          from sysobjects
          where  id = object_id('p_ScrambleDate_')
          and type in ('IF', 'FN', 'TF'))
   drop function p_ScrambleDate_
go
 
CREATE FUNCTION p_ScrambleDate_(@Date_ datetime)
RETURNS datetime
AS
BEGIN
 
	DECLARE @RandomNumber_ tinyint
    --Generate a random number between 1 and 25 (no special reason for 25 but it is a reasonable number of years to decrement date.
    SET @RandomNumber_ = Convert(tinyint, ((24) * (SELECT RandomNumber_ FROM v_RandomNumber_) + 1)) 
    SELECT @date_ = DateAdd(d,@RandomNumber_,@date_) --Increment date by RandomNumber_ days
    SELECT @date_ = DateAdd(m,@RandomNumber_,@date_) --Increment date by RandomNumber_ months
    SELECT @date_ = DateAdd(yy,(@RandomNumber_ * -1),@date_) --Decrement date by RandomNumber_ years.
 
	RETURN @Date_ 
END
GO 

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial