Link to home
Start Free TrialLog in
Avatar of slatefamily
slatefamily

asked on

Desperate for help! Soundex Function with a twist

There has to be an easier way to do this!!  I have a file with 123,000 records.  Several of them are duplicates but not exact matches.  I recently had help on here by identifying duplicates with fuzzy logic using a Sound-ex function.  Now I was wondering if there is a way for me to create a tab that copies these "identified duplicates" to another worksheet.  For example:

I created my soundex query in MS Access and exported to excel where the query created a field with a value like A125

I am working on identifying duplicates that need to be merged together.

For example Worksheet 1: (notice that ID is unique, Name/NPI have similar variations)

ID          Name                    NPI                    Soundex
Q-1234    Smith, Doctor      13390482            A125
Q-5585    Smith, D@                                        A125
Q-1263    Smith, Dr.            013390482          A125
Q-7825    Smith, Doc           133904820          A125
Q-6525    Smyth, Doctor N.                             A125

What I need Worksheet 2:

Name                 Merge   Merge     Merge    Merge   Merge
Smith, Doctor    Q-1234  Q-5585   Q-1263   Q-7825   Q-6525


Any help will be useful, I am manually going through the soundex's and then copying and pasting to another file if I think they are duplicates that need to be merged
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You can also do a copy/paste special Transform on each of the first column values in each group.
Avatar of slatefamily
slatefamily

ASKER

that might be an idea if I was able to better identify the possible duplicates... currently I am still going line by line validating if it is a duplicate or not.  If I have to go line by line then I don't really need a formula for this because the copying and pasting will not take any less time.
I think we can help make this a sane task.

Please post a workbook with two populated worksheets.  One sheet is your starting point and the second worksheet is the data in the form you need it.  The first worksheet only needs to have about 50 rows worth of data.

Question: How do you determine which of the names is to be used?
If you could help with this, you would be a life saver.  I would owe you more than 500 points but that is all it would let me post!!  I was going to use the most complete name, but really it doesn't matter as long as I can determine the possible duplicates because these will be merged.  Also, I will be doing another match against the medical staff database which will have the exact name it should be.  So in other words I don't really care what is listed as the name as long as all the possible duplicates are listed together because they will all be merged together.

Let me get the file together (I need to clean it up some)
Here is an example, I started the template that I need it in on the second sheet.  

Names will look a little funny because I did a bit of scrambling.
Duplicate-Merge-Example.xlsx
Would you prefer to select the Soundex values or the CODE values?
The code values are all unique, so i'm not sure if that would help with selection?
If you are selecting cells to be laid out in the manner you describe, would you rather select cells in the CODE column or cells in the Soundex column?  Please answer that question.
I am identifying Code columns that should be merged.  But the soundex was only a tool I had to help me identify "like" fields.  It did not give me enough to go on, and I still had to look through the rows.

So for example not all of the soundex codes are true duplicates:  See how they all have a soundex code of A132, but below you will see how I split up the true duplicates.

duplicate/merge group 1
18958      Q-93429      qbqd,Zoilo R      A132
105070      Q-488931      qbqd,Zoilo@      A132
33259      Q-67400      qbqd,Zoilo@      A132
duplicate/merge group 2
767      Q-43307      qbboud,Joseph M Md      A132
10537      Q-25795      qbboud,Joseph q      A132
1434      Q-43342      qbboud,Joseph      A132
duplicate/merge group 3
85662      Q-30421      qbboud,Soo K      A132
23607      Q-29211      qbboud,Soo      A132
85696      Q-24513      qbboud,Soo@      A132
33422      Q-65679      qbboud,Soo@      A132
96163      Q-89291      qbboud,Soo@      A132
duplicate/merge group 4
118950      Q-111061      qbboudi,Jqck      A132
10536      Q-98105      qbboudi,Jqck      A132
71427      Q-13045      qbboudi,Jqck Md@      A132
61840      Q-48391      qbboudi,Jqck Md@      A132
68104      Q-15206      qbboudi,Jqck Md@      A132

Rows with no duplicate/merges:
10539      Q-4725      qbbud,Ziqd q      A132

38114      Q-37895      qbdi,Sqlqhqdin      A132

26058      Q-85551      qbdi,Ziq      A132

10540      Q-18389      qbdi,Zqhrq J      A132

8017      Q-88346      qbbott,Kqthleen      A132

25257      Q-20818      qbboud,Chqrles F      A132

18956      Q-44182      qbqd,Jorge R      A132

20638      Q-21425      qbboud,George      A132
but if you could not help with the "fuzzy logic" narrowing down the soundex, I guess I would select the Soundex column to put in the worksheet 2 format.

I could delete out the soundex column if it is not a duplicate or change it, but I would still have to go row by row on 125,000 records.

There has to be a better way ;(
ok I might have a better way for myself as far as identifying the duplicates.  Here is another file.  I took out the names all together but it is the same deal.

I would want to concatenate the "Codes" in columns by using the "Group" as the selection.

Group       ID            Code          Soundex
15099      17747      Q-20714      G426
15099      17745      Q-11780      G426
15099      17743      Q-26443      G426

into this:
Group     Merge    Merge   Merge
15099   Q-20714  Q-11780  Q-26443
DataQual1-example.xls
Like I wrote earlier, we have to go back to the database.  You need to generate two Soundex values in your query -- first name and last name.

Sort the query by these two columns.

There is no guarantee that the different rows you want to select will have the same two Soundex values, but it will likely get you closer than you are.  You will still need to process this by hand unless you have a more reliable NPI or name source.

I've mocked up a macro that will transform the group you select, but that is based on the single Soundex column in your sample.  However, I'd prefer you generate a new sample with the two Soundex columns.
I do have some columns that will help like NPI, but sometimes they are not filled out.  I did a trial demo of a data quality software that let me weight and give percentages of how close to match.  This is what gave me the groups from the example above.  I will try the soundex for the multiple fields in the database.
that's what I'm waiting on
would it need to be created in the database?  for example what if I find that the other attachment I put on here is more accurate as far as how it identified duplicates?
or do you mean create another soundex and then run it through the other program from attachment 2?
Yes.  Drop back into the database.  Generate output with the two Soundex columns I've described.  Include the NPI column and any column that might help differentiate the docs.

I only need 50 rows.  I don't need the after/processed worksheet this time, since I think I know what you're after.
Can I do a soundex like formula for NPI which is normally a 9 digit number?
Access query still running, once it is done I'll run it through the data quality program to produce groups.  Thank you for helping :)
Just ran it through data quality program to add groups.
thanks
What is the group number column?
Where is the NPI?

I only see rows that have duplicates.  Please include a batch that has non-duplicated data.  It is difficult to find a solution if you keep changing the description of the problem.  As stated at the start of this question and in your prior example data, you have a mixture of duplicated and non-duplicated rows.  That is the problem I'm trying to help you with, not a duplicates-only view.

Part of the problem is in identifying what constitutes the eligible data for the group and either automating or assisting the data transformation.
ok sorry I thought you wanted me to group it again by running the data through the data quality software I downloaded a trial version.
Here is another sample, sorry I misunderstood what I needed to run.
Duplicate-Merging-Example3.xls
much better.  thank you.
First Analysis:

Please take a look at row 61 (A120, P362).  There are two different NPI numbers associated with this doc (9477589083, 9700889873).  This might be a new NPI value or may be father/son docs.

NPI 9093098969 is associated with two different Soundex tuples ([A400, M300], [M300, M300])

NPI 9093936400 is associated with two different Soundex tuples ([A135, C642], [A135, G600])

NPI 9346899953 is associated with two different Soundex tuples ([A636, E421], [M300, E421])
I started noting these and stopped around row 1674.
9356559079
A432	A420
A435	A420

9366503409
A325	K500
A325	K516

9467687004
A216	K300
A256	K300

9477588308
A200	B620
C510	B620

9487965778
A352	T000
M300	T000

9538948980
A514	A500
A514	J400

9548387857
A260	J500
A600	J500

9659349363
A535	D640
A535	M620

9770564577
A250	P360
A252	P360

9770599787
A260	M240
A260	M400

9780089908
M300	R210
M300	V500

9805857554
A260	B600
A260	B630

9879530949
A530	S000
M300	S000

Open in new window

I think you have a problem with your name parsing (somewhere along the way).  Some of your last name values are "MD".  Look at rows 38-39 for an example. Several of the NPI with different Soundex tuples I listed above are the result of this last name parsing problem.

Are the name data in your database stored as seen in the [Full Name] column?  I think I was commenting on the parsing process in the prior related question.

It looks like you've got some clinics and hospital organizations in your list.

It looks like you've got some Resident teams in your list.

It would be best if you limit your data transformation to only docs.

Are the @ rows significant?

Are the N/a rows significant?

=======
What I want to do is to ensure we are starting with reliable and clean data.  We can then apply some bulk process to do the majority of the transformation, excluding some 'messy' groups.  Then facilitate the manual process.

BTW...I'm glad you posted a lot of rows.
The @ rows are doctors that are inactive.  The end user doesn't see it as an option to choose, but I still need to merge them if they have a duplicate entry that is active.

N/A rows- bad data.  I will need to extract those out for deletion.  The same with any name that has "invalid" or "unknown".  I will even want to extract out anything that looks like a facility or practice name (ie Nursing Home).  I basically want to extract out anything that doesn't look like a real person (ie A1- Resident).  

I agree the reason I am doing this, is working on cleaning up the dictionary.  It is extremely messy.

I was thinking I could go back into the application to run more fields like address, license number (if that would help)?

I think MD is in another field that brings it into that Full name, however going forward we will not be putting "MD" as part of the name.  Part of the clean up i'll remove those kind of things (same with Middle Name/initial with and without periods).  There are 2 fields- one called Full name and one called First_Middle_Last.  I think that is the difference on whether you see MD or some of the other types.  I can bring both of those fields in?

I did not know a good way to extract all of the other rows out that are not doctors.
Please take a look at row 61 (A120, P362).  There are two different NPI numbers associated with this doc (9477589083, 9700889873).  This might be a new NPI value or may be father/son docs:

This one I am pretty sure it is Father/Son, because the credentialing database that sent me the file contains two records as well.  We did not ask them to provide us with DOB, but now I am seeing that would have been helpful.  

1.  NPI 9093098969 is associated with two different Soundex tuples ([A400, M300], [M300, M300])
2.  NPI 9093936400 is associated with two different Soundex tuples ([A135, C642], [A135, G600])
3.  NPI 9346899953 is associated with two different Soundex tuples ([A636, E421], [M300, E421])

Unfortunately are examples of how it was entered into the database incorrectly or multiple times and should be merged because there was no validation put in place.  We have made the NPI field mandatory and acts as a unique ID so no duplicates can be added.  The second example- all of these would be merged for the one person
ID
23733
56152
62860
67432
72614
85903
92413
63032
68164
72791
121491
This dictionary started in 2009, so it has had a few years to accumulate bad data because of the way it was set up.  This has been quite the project!
What is the nature of the back end data repository from which you are extracting the data you've posted? (flat file, Access, Excel, RDBMS like SQL Server/Oracle/DB2/MySQL, XML, other)

Yes.  I'd like to see the First_Middle_Last column as well as the full name column.

It is time to look at the meta data.  Do you have a data dictionary?

I hope that there will be one or more fields/columns/elements that can be used to exclude organizations and other rows that should not be processed.  If this has been encoded in the full name text field, then we'll deal with it from that stand point.  However, it isn't as reliable or as good a performer as a separate column.

Are you the owner of the data or is this clean-up for some other purpose?
the way it was exported is the way it was listed in the dictionary.  The application is called Midas.  I can extract it out in either a txt file or excel file.  Then I import it into access database so that it is somewhat easier to manipulate.  I'll get a listing of all the fields in the dictionary, and i'll also pull all the fields in.  I will have to run it out Midas again because I did not pull all the fields out initially.
How the data got into Midas is from an HIS system and it was being interfaced over dynamically.  We have shut that down and now are going to do file loads, because the HIS system was being used for other items.  For example people were entering in fax numbers in the Provider HIS system.  I believe all those are deleted out now, but just giving you some background.

Several hospitals in the system all with different HIS systems and credentialing databases.
The clean up will benefit several purposes, so I am not the sole owner of the dictionary but I am so far the only one that has been able to tackle data integrity.  

There has been many steps in this project.

Once I identify all of the rows that are for clinic, or NA (unless they have an NPI number then I would need to look a little closer to see if it should be applied to a single doctor).  

The Midas vendor would delete them out of the database.  They also gave me that template, as far as putting all the Q-codes together for merging.  Once I identify records for merging, then the vendor will be completing that task.
I faced a similar data clean-up challenge when I was brought into Duke to help with their recredentialing and QA work back in the mid 90s.  Fortunately, my contact has developed into a power user and her personality is such that she won't be pushed around by the docs or the chairs.

Her work is of sufficiently high quality that her reports are placed on the top of the stack when JCAHO pops in for a visit (surprise or scheduled).
any progress today?
I thought I posted earlier, but I guess it didn't submit?  Anyway, I added in a field called "Other Numbers" that is a dr identifier but some of the other fields available have no data in it so they would not be helpful.  The other numbers field created some duplicates (as far as the ID and Qcode).

Take a look
DuplicateQuery.xlsx
The names are much cleaner.  I was able to delete some "N/a" and "unknown" rows and easily remove the "@" character.

So, no luck in getting the DOB for the docs?
Although the address is helpful, there are instances of father and son docs working in the same practice.  We still face the same problem, illustrated with this set of codes:
Q-87919
Q-17178
Q-22984
Q-31220
Q-32369
Q-34369
Q-53747
Q-58830
Q-62460
Q-68812
Q-87316

Open in new window

It might be possible to identify these groups and process them manually.  However, you will still need some data to differentiate them.  If we can identify and include those columns, then this might be fully automated.

I'm not overly concerned with the duplicates.  I do find it curious that the duplicate entries are the only ones where one of the pair has an address.  I suspect that is the source of the duplication.

I don't see the A1- Resident column. Have you gathered the data dictionary information?

=========
Although we're playing with this data in Excel, it is quite possible that the duplicate merge list can be created in Access.  We can talk about that when our analysis has completed.
for all these records there is no DOB, unfortunately what was put in the system (MIDAS) to begin with is what I exported.  In the future we will have more information, but right now this is all I have.
The "other number" field and address was the only other field I was able to bring in that contained data.  A1- I did not extract it, it just wasn't in the rows I copied and pasted into excel.

If I did a column collect for the address's it might not have duplicated that way.  The way it was originally going to export it was a line for every item listed in the "other number".  I did a concatenate function to pull all the "other numbers" into one field.

I agree that it would best be done in access.  I was looking to see if I could do a metaphone qry (i think that is the name), so it could look for like names such as cline and kline.
as far as Father/Son, if there is a different NPI I would not merge them.  If there are like names with no unique differentiation then I will end up merging them.
If there is any doubt, I would not merge the records.  Unless there are some additional columns in some other table or data source (even flat files), then this will not be a 100% conversion.  Moreover, I found 11 groups that will require manual processing (out of 208 candidate docs).  Without other data, or external help resolving these, then I would let the powers that be have the list with knowledge that they can't be merged by you.

Question: What is the Provider ID field?

>>cline and kline
That would be the use of the Soundex value.  However, we aren't really doing that kind of fuzzy matching here.  We are trying to clean up the data.  I only see one doc that might have a name typo (Q-956181)

==========
On your next iteration, please include the middle initial.
I put in middle initial, but this go around I did not concantenate the "other number" field because the processing time was taking to long.
ee.xls
Provider ID field is a primary key like an autonumber in the midas system (it is not generated by us)

I know it would still require some manual work, and any questionable ones would be sent to a medical staff to make a final decision.

Again,  Thanks for you help.
clean-up
434 unknown or n/a rows deleted
71 invalid rows
some parsing problem exists with Provider ID = 81183, Code = Q-10070
some parsing problems still exists with MI values picking up the title (DO & MD & CRNP & DPM, for instance)
what do you want to do about the GROUP and CLINIC rows?
multiple names still give your parsing code fits (Q-10208)
I have no idea what to do with Q-119042

You have made it difficult for me to check my work with the way you substituted characters
aes->zqb.

Either generate the soundex values after doing the character substitution or don't do the character substitution.  Since the soundex formula will give questionable results I would recommend no substitution. This isn't patient data, so you wouldn't violate any HIPAA rules.
Q-119042 Looks like this is a Physician Group, and these would not stay in the dictionary.

Sorry for making it harder, it was just a lot of data that I wanted to be able scramble some but I understand that it isn't HIPAA.
DuplicateQueryEEcombined.xlsx
can you create a tab-delimited version of this and compress it into a zip file?  It would be so much easier to import into Access.  I'm limited on this PC to Office 2003, which cuts off your xlsx file at 64k rows :-(

Do you have to have the CODE values laid out in a worksheet like you described above or is the real need some way to identify the groups of CODE values to the people who are going to actually clean up the data via consolidation?
The vendor wants us to provide the actual code values that need to be merged together.  That is the format they gave me, but I can check to see if it must be listed that way.
Can you email me?  Is this more complex than ee?

=======
edited by aikimark to remove email address
Hi Mark,

Were you able to look at the file that was uploaded?
If you have a suggestion on the clean up and duplicate identification, I have created a concatenate function to put it in the layout requested by the vendor.
@Slatefamily

Yes.  I've been doing some work on the file, creating some queries/steps that you should be able to recreate for your data.  I had to put it on hold for the past several days because of billable work and preparation for a user group meeting this coming Tuesday.

I'll keep your concatenation in mind when I get to that point in the clean-up process.
Hi Mark, just checking in to see if you were able to look at this.  (also wanted to post a message since I got an automessage that this was requesting my attention)
Hi Mark, checking in to see you were able to look at this more.

Thanks
I don't know why you received a notification message. I'll have something to discuss tomorrow.
I've requested that this question be deleted for the following reason:

_alias99<br />Community Support Moderator
I'm actively working on a solution, and so stopping the delete request.
Hi Mark,  I can open a new question.  I requested to delete the last file attached, but the moderator informed me that I would need to delete the question.
Hi Mark,

I wanted to check in, I appreciate you helping me.

Thanks,

Carman
Hi Mark,

I wanted to check in again, I appreciate your help

THanks,

Carman
Hi Mark,

Thanks for the help with getting all the invalid data out, checking to see if you were able to move further?

Carman
Hi Mark,

Checking in for an update, thank you for your help.
I've requested that this question be deleted for the following reason:

The question has either no comments or not enough useful information to be called an &quot;answer&quot;.
I'm finishing up my work on this problem and am actively working with slatefamily on the solution.
Mark has provided help above and beyond.  I can not speak more highly of him, and recommend his expertise.  He is exactly the kind of "expert" that makes this website worth the monthly fee!!