Link to home
Start Free TrialLog in
Avatar of BarryWilliams
BarryWilliams

asked on

how can I anonymise name and address data

how can I anonymise name and address data.
I am using Oracle but a generic SQL solution would be acceptable.

Barry

Avatar of momi_sabag
momi_sabag
Flag of United States of America image

you can either use your own update queries to change the data, or you can use some tools that are on the market (thier are not free though).
I know a tool that is called JumbleDB,
it is pretty good and I find it useful
Avatar of Lee Wadwell
Hi BarryWilliams,

if you want to really 'mess' the data and keeping nothing meaningful (e.g. titles "MR', MRS' ... street types 'ST', 'AVE' etc) then you can do simple character replacement. e.g.
SELECT name,
       TRANSLATE(UPPER(name),'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
                                      DBMS_RANDOM.STRING('U', 26)) AS obsfucated_name
  FROM your_table

Similar things can be done with numbers.

lwadwell
You can also update the various columns to meaningless data by inserting subtrings from NEW_IDs (i.e. GUIDS).  
Another technique that I have used is to create a table of FirstNames and a table of LastNames and then a table of LastNames and FIrstnames.  Then you create a process/procedure that updates rows with randomly selected FirstNames and LastNames such that they don't match the existing FirstName and LastName in the row.  This takes a bit longer but results in more "human consumable" and more "realizstic" results.  (Note, this can also be extended to street names, cities, and states if need be.)  (I did this due to a requirement that the "development database" not have "prduction data" but have "production like" data and I should warn you that it can take a while, as will most of the obfuscation methods. ;-)
As with everything else in the IT world, it kind of depends on what you are trying to do.  
Avatar of Member_2_276102
Member_2_276102

You don't disclose what platform you're running Oracle on, but maybe it doesn't matter. Note that things _might_ be complicated if there are any referential constraints over any column that you want to anonymize. Names and addresses don't usually make for good referential columns, but it does happen.

Tom
Avatar of BarryWilliams

ASKER

Hi 8080

Thanks for your suggestion, which is exactly what I am doing.
I have names of 3,000 commercial organisations that I want to replace by realistic-looking names, but not real names.

I have extracted the separate words in the names of organisations.
For example, from 'Bank of America' I get 'Bank', 'of' and 'America.
Now I want to create false names from combinations of these words, taken at random.
Right nowl I am stumbling over the Oracle SQL to retrieve a random selection from the list of separate words.
I am using Oracle SAMPLE to get a random selection of words.

I think I have to use PL/SQL and use a cursor to loop through all my 3,000 organisation names, but my PL/SQL is very rusty.
Am I thinking along the right lines ?
I would certainly appreciate any help you can give me to get going with PL/SQL.

Thanks very much.

Barry
 
Barry,
You may want to consider creating an application instead of trying to do this in PL/SQL with a cursor.  I say this out of experience in dealing with this sort of thing. ;-)
One issue you will encounter is that, using your example words extracted from Bank Of America as a start point, you may well wind up "of", "the", and "a" as words in your list and then "The Of A" or (having extracted words from "First National Bank of Tulsa" and "The Dollar Store") "Of The First" as your psuedo-name.   In other words, the approach I mentioned works well for human names but quickly breaks down for commercial names unless you create an application that lets you tweak the "rules" for name creation.  
It has been a while since I wrote my name obfuscator and, when I did, I took something of a short cut in dealing with names that were illogical (e.g. "The Of A").  I would simply NULL out the "bad" names and then rerun the app so that it would generate new names for any that were NULL.   Of course, this meant that there was human intervention involved in the process because someonehad to review the newly created names for "questionable" ones.  (I put a copy of all newly generated names into a separate table for review and, at the start of the name generation process, I truncated that table. ;-)  
Hi 8080 Diver
Thanks a lot for your follow-up response.
PL/SQL is as far as I want to go in the direction of an applicaition ;-)
I have already planned for the kind of rpoblems ypou mention, and my solution will be pretty much in line with yours.
My strategy, of course, is to run a series of Scripts which I would run repetitively to generate, clean-up and finally publish the new anonymous names.
I wonder whether you have any SQL or PL/SQL that you would be abel to share with me to get me started ?

Regards

Barry

 
Barry,
Let me check my "repository".  Part of the process is going to depend on things like:
  1. How many distinct "words" have you extracted?
  2. How many "words do you want in your "Company Name" that you generate?
  3. Do you want to allow for "Company Names" that range from 1(or 2) "words" to a maximum number of words?
  4. If you have single word names (e.g. "Wachovia"), do you plan to allow that to appear somewhere or do you want to substitute some other "word" for, for instance "Wachovia" and then use that substituted word as a potential obfuscated name inclusion?
One approach might be to define a set of single word obfuscations (i.e. omit the articles and most adjectives from your extracted list) and make an initial pass that assigns those to the first X rows as the "company name".  Then add, for instance, a single adjective (not a single specific adjective but, rather, only one adjective per name)  to those names for the next Y rows.  Then proceed to 2 adjectives or, perhaps, adding an "of [something]" phrase.
I think I would probably add 3 or 4 columns to the extracted words table and have those autofill with GUIDS because GUIDS make reasonably good psuedo random numbers for ordering and selecting rows.  In fact, I might create a table of "PsuedoCompanyNames" that I would pump the results into.  That table would have a GUID (NewID, I think) for the identity column and a unique index on the CompanyName column.  Then you could just crank up a generator that builds 1, 2, 3, 4, . . . , N "word" names which you could then assign to the data.
To generate the 1-4 "word" names you could do something like the attached.
Note: I make the following assumptions in the Attached SQL
  • There is a table (NewCompanyNames) with at least 2 columns, one being some sort of identity column and the other (CompanyName) being an appropriately size VarChar;
  • There is a table (ExtractedWords) that has at least one VarChar column (
    WordFromName) that has teh words extracted from the names.
INSERT INTO NewCompanyNames
(
 CompanyName
)
SELECT WordFromName AS CompanyName
FROM   ExtractedWords
UNION ALL
(
	SELECT W1.WordFromName + ' ' + W2.WordFromName AS CompanyName
	FROM 
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W1,
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W2
) N2
UNION ALL
(
	SELECT W1.WordFromName + ' ' + W2.WordFromName + ' ' + W3.WordFromName AS CompanyName
	FROM 
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W1,
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W2,
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W3
) N3
UNION ALL
(
	SELECT W1.WordFromName + ' ' + W2.WordFromName + ' ' + W3.WordFromName + ' ' + W4.WordFromName AS CompanyName
	FROM 
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W1,
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W2,
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W3,
	(
   SELECT WordFromName
   FROM ExtractedWords
  ) W4
) N4;

Open in new window

Hi Barry,

How far do you need to go with making these names anonymous?  Is changing 'Bank of America' to 'Bank of Thailand' sufficient?  Will changing 'Bank of America' to 'Shoes of England' suffice?   Or do you need something like "Betty's Flower Shop"?

If either of the first two, inline subsitution should work just fine.  Identify the keywords that you want to change and let a series of simple updates do the work.  Look at the samples below.

There are some things to note.

- Each list is circular.  In the first update, 'America' is changed to 'Thailand', 'Thailand' is changed to 'Canada', 'Canada' is changed to 'Germany', etc.  In the last item, 'England' is changed back to 'America'.  Run the update once and 'Bank of America' becomes 'Bank of Thailand'.  Run it a second time and you get 'Bank of Canada'.  Run the update as many times as you want.

- Each list should strive to have a different number of keywords (cases).  This will help to better randomize the results when the number of passes exceeds the number of items in the list.  (If you run the first update 7 times, you're running it one more time than than there are elements in the list.  So the result will be that the first update will have changed 'America' all the way through the list back to the top, settling on 'Thailand'.  If you also run the second update 7 times, you're running it 3 more times than there are elements in that list.  The second update will have changed 'Shoes' all the way through the list back to the top, settling on 'Plumbing'.)

It's certainly not perfect, but if you can identify one or two keywords in each name that can be put into these lists you can easily obfuscate the names into something that looks like a real name.  No one will ever see 'Shoes of England' and understand that the real name is the 'Bank of America'.


Good Luck,
Kent

UPDATE Companies
SET Name = 
  CASE when instr (Name, 'America')  > 0 then replace (Name, 'America', 'Thailand')
       when instr (Name, 'Thailand') > 0 then replace (Name, 'Thailand', 'Canada')
       when instr (Name, 'Canada')   > 0 then replace (Name, 'Canada',  'Germany')
       when instr (Name, 'Germany')  > 0 then replace (Name, 'Germany', 'Mexico')
       when instr (Name, 'Mexico')  > 0 then replace (Name, 'Mexico',  'England')
       when instr (Name, 'England') > 0 then replace (Name, 'England', 'America)
  END;

UPDATE Companies
SET Name =
  CASE when instr (Name, 'Shoes') > 0         then replace (Name, 'Shoes', 'Leather Goods')
       when instr (Name, 'Leather Goods') > 0 then replace (Name, 'Leather Goods', 'Bakery')
       when instr (Name, 'Bakery') > 0        then replace (Name, 'Bakery', 'Plumbing')
       when instr (Name, 'Plumbing') > 0      then replace (Name, 'Plumbing', 'Shoes')
  END;

Open in new window

Hi Kdo

Thanks for taking the time to suggest an approach to my problem.

I have about 3,000 Organisation names to scramble.
Your approach would seem OK for small numbers where I could create the SQL Scripts manually.
Is this what you have in mind ?

For 3,000 organisaitons, I guess I could generate an SQL Script automatically to populate your Scripts with Organisation names taken from the existing Organisations table but it does not seem like an intuitive approach.

Barry
 
Barry,
Populating the table required to drive the scripts that Kdo suggested is pretty much what I had suggested earlier.  
It may not seem "intuitive" to you but what is your "intuitive" alternative?  Would you want to create the SQL Scripts by hand?  Do you want to populate tables to drive the SQL Scripts by hand?  
Populating a table based upon words from the existing names seems pretty intuitive to me.  I might even go so far as to use that table to create a substitution table that has each word that you want to obfuscate in the first column and then randomly assigned words in the second column (making sure, of course, that both columns don't have the same word in them ;-).  That way, you can make a pass therough the existing data and substitute for each of the words you want to obfuscate.  That seems the most intuitive approach to me . . . at least, a lot more intuiive than doing it all by hand. ;-)
Hi 8080 Diver,

Thanks very much - your suggestion was GREAT ;)

I really appreciate your taking the time to write out the SQL Script for me.

I tried it out this afternoon and I think it will solve my problem.

I am not sure how the Experts Exchange system works.

I am quite happy to send you money if that is appropriate.

If not, any time you want help with any Database problem just let me know by visitng my Database Answers Web Site - http://www.databaseanswers.org/index.htm

Regards

Barry
 
Hi Barry,

It's really not that bad a job.  :)  

You can identify keywords in the names that will trigger an automatic change.  Best is if you can identify two or more per name.

SELECT count(*) from Companies where upper (Name) like '%BANK%';
SELECT count(*) from Companies where upper (Name) like '%PLUMBER%';
SELECT count(*) from Companies where upper (Name) like '%BAKERY%';
SELECT count(*) from Companies where upper (Name) like '%PAWN SHOP%';

etc.

Then find the 'remainder' by selecting those that you haven't yet fit.

SELECT count(*) from Companies
WHERE not (upper (Name) like '%BANK%'
             OR upper (Name) like '%PLUMBER%'
             OR upper (Name) like '%BAKERY%'
             OR upper (Name) like '%PAWN SHOP%');

You'll should be able to identify a large percentage of the names by keyword search.  Undoubtedly, you'll find some that don't.  The exceptions are handled by making up a name for each exception and placing the exceptions in their own SQL update list so that repeated executions cycle them, too.


Kent


Barry,
The process at EE is to pick the answer (or answers) that best suits your need and solves your problem.  You then award points to that answer. ;-)
Glad to be of assistance.
Kent

I guess I misunderstood your suggestion.
It seems you are suggesting a mix of manual and automatically scripted methods.
If I am right, it would be very useful if you could clarify how you see this working in practice.

Thanks

Barry
Sure.

I'm thinking 100% automatic scripting.  There is some up front (manual) work to identify all of the keywords that you want to substitute.  My last post (https://www.experts-exchange.com/questions/25000767/how-can-I-anonymise-name-and-address-data.html?cid=1572&anchorAnswerId=26140843#a26140843) is the first step at identifying the keywords.

Once the scripts are built, the can be run any number of times and at any time to generate a new set of random company names.


Diver and I are thinking very similarly.  He has you put keywords into a table and generates completely fictitious names by randomly picking words from the list to create a fictitious name.

I'm envisioning a series of scripts that contain direct substitution of keywords.  The difference is subtle.  Diver's solution generating something like "Anne's Boutique", "Bob's Cafe", "Dan's Bistro", etc for all of the names regardless of the original name.  My solution substitutes keywords so that "Bank of America" could become something like "Bail Bonds of Miami".

Both solutions have the potential to create duplicates.  That may be a concern.


Kent

@KDO,
Both solutions have the potential to create duplicates.  That may be a concern.
This is why I would suggest putting the obfuscated names in a table that has a uniqueness constraint on the name.  Of course, you have to handle the duplicate row error by ignoring it but that is fairly easily done.  Once you have enough names in the Obfuscated table, you can pretty much assign them at random.
Now, with your substitution method, if you have a 1 to 1 association between, for instance "Bail Bonds" and "Bank", then you will only have duplications if there are already duplications.  If, on the other hand, you drive the the process off of the list of words (which have been inserted into a table with a uniqueness constraint on the words) using the unioned selections that I suggested, then you won't get any duplicates, by definition.
However, as you pointed out, our two solutions are variations of the same theme. ;-)

Kent

I have tinkered with your original SQL a little but haven't figured out a way to avoid Cartesion products.
I end up, of coiurse, with each word being associated with each other word'.
It seems I will have to write some PL/SQL using cursors to step through my table of original organisation names and my 'obfuscation replacements' .
Or is there a neat and simple solution that hasn't occurred to me ?

Thanks very much.

Barry


Barry,
If you extract the key words from your original names (whether by hand or via a SQL query) and add them to a table (that has an identity column as well) and you set up your obfuscation words in another table (that also has a column that you use to randomly assign a number between 1 and the number of rows in your original name words table), you can match them on the identity columns, i.e.
SELECT OrignalWord,
                 ObfuscationWord
FROM      OriginalWords OW
INNER JOIN ObfuscationWords OB
ON  OB.ID=OW.ID;
Then you need to use the results of that to UPDATE you data so that you REPLACE the OriginalWOrd in the CompanyName column with the ObfuscationWord.
Of course, this presupposes that you have as many obfuscation words as you do original words.
You might want to create a temp table with the results of the above query and then drive your update off of that temp table.
Hi 8080 Diver

Thanks very much for your continued assistance.

It's about 11.00 pm here in London, England so I'll be able to put your suggestions into practice tomorrow when I get to work.

Barry


ASKER CERTIFIED SOLUTION
Avatar of 8080_Diver
8080_Diver
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Barry,

There is absolutely no way that the SQL I provided can generate a cartesian product.  To generate a cartesian, there must be a join and the sql above simply scans through 1 column of a table, changing the items in it.


Kent

Hi Barry,

There is a  variation on both themes (Diver's and mine) that can eliminate duplicates.

Using either Diver's tables or my inline SQL generate a unique, sequential, value for each Company using the row_number() function.  (If the table already has an identity column, that will work fine.)

Of course, the values stored in FakeName1 and FakeName2 can also be included in the query in part of the WITH structure without having to actually store them in a table.


Kent

create table FakeName1 (
  id    integer,
  Text  varchar (20)
);

create table FakeName2 (
  id    integer,
  Text  varchar (20)
);

INSERT into FakeName1 (Text) values ('Anne');
INSERT into FakeName1 (Text) values ('Bob');
INSERT into FakeName1 (Text) values ('Claire');
INSERT into FakeName1 (Text) values ('David');
INSERT into FakeName1 (Text) values ('Elaine');

INSERT into FakeName2 (Text) values ('Bakery');
INSERT into FakeName2 (Text) values ('Cafe');
INSERT into FakeName2 (Text) values ('Deli');
INSERT into FakeName2 (Text) values ('Engineering');
INSERT into FakeName2 (Text) values ('Fashions');
INSERT into FakeName2 (Text) values ('Garden Supplies');

WITH N1 (NameCount)
AS
(
  SELECT max (id) FROM FakeName1
),
N2 (OccupationCount)
AS
(
  SELECT max (id) FROM FakeName2
),
N3 (CompanyName, Seq)
(
  Select Name, row_number () over () FROM Companies
)
UPDATE COMPANIES
SET Name = (SELECT Text FROM FakeName1 WHERE Id = mod (Seq, NameCount) + 1) || '''s ' ||
           (SELECT Text FROM FakeName2 WHERE Id = mod (Seq, OccupationCount) + 1;

Open in new window

8080 Diver was very helpful, very professional and quick to respond and provide clarification where appropriate.