• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 502
  • Last Modified:

dedupe in foxpro

Any suggestions on de-dupeing within address only fields, i.e address1-5

EXAMPLE

ATTN: JOHN SMITH | 123 MAIN STREET | SUITE 200
123 MAIN STREET   | SUITE 200 | ATTN: JOHN SMITH

Thank you in advance
0
dc070994
Asked:
dc070994
  • 4
  • 2
  • 2
  • +3
3 Solutions
 
Kalpesh ChhatralaSoftware ConsultantCommented:
Hi,

Please Read below article regarding deduping.

http://portal.dfpug.de/dfpug/Dokumente/FoxTalk/FoxTalk1997/FT19975_1.PDF
0
 
aikimarkCommented:
@dc070994

Are you looking to de-dupe only within a single record?

I might suggest coding a Levenshtein distance calculator (or similar) to identify potential duplicate items.

The exact matches should be easy to detect.  The difficulty is partial and fuzzy matches and matches that involve spelling errors.

You might want to consider some massaging of the field data to help whichever algorithm you choose.
* Remove duplicate space characters
* Remove typos
* Upper-case the characters, allowing you to do a binary comparison which is much faster than a text (case insensitive) comparison.

Back in April, I answered a similar matching question in the Excel zone (http:Q_28406055.html ).  You might employ a solution, where you start comparing your fields with the length of the shorter of the two fields using a regular expression, reducing the number of contiguous characters in the pattern until you reach a match or some 'reasonable' limit on similarity between the two fields (80%-50% of the field/string length).
Note: in that question, the author had asked for a 3-5 character match, but found that he had to increase to 8-10 characters to reduce the number of matches found.
0
 
dc070994Author Commented:
Hi aikimark, thank you for your quick response.

This is for all parties in our database, we want to minimize over serving notices to parties to cut there cost. the below examples are the issue where it is a dup but the md5 hash wont identify it as such because the data in the address fields differ in field location,
i.e  address1 = A address2 = B address3 = C and other will show as address1 = C address2 = A address3 = B

The below is an example of the data in our system, i currently use the md5 hash function to create hash keys for the name + address information, it does the massaging of the data and works well in identifying dups

However the below example differs in that while the record is by site are duplicate record because the values in both rows are the same, the addresses data reside in different address lines as in the example below.

I want the below to be marked as a duplicate

ADDRESS1                  ADDRESS2                 ADDRESS3
ATTN: JOHN SMITH | 123 MAIN STREET | SUITE 200
123 MAIN STREET   | SUITE 200 | ATTN: JOHN SMITH
123 MAIN STREET   | SUITE 200 | ATTN: JOHN SMITH
ATTN: JOHN SMITH | 123 MAIN STREET | SUITE 200

Thank you in advance
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
aikimarkCommented:
That seems simple enough.  You have two formats for your addresses:
1. Name (addressee), Addr1, Addr2
2, Addr1, Addr2, Name

Calculate the hash of the concatenated (with delimiter) versions of both formats of these fields, along with the record/row key.  Then join this hash table to itself, looking for matches.
0
 
GEOFSCommented:
It seems to me that you'll have to compare the various combinations of fields that might contain the 'same' value, e.g., ADDRESS1 with ADDRESS2, ADDRESS1 with ADDRESS3, ADDRESS1 with ADDRESS3.
0
 
aikimarkCommented:
With each additional (possible) format, you would add a hash before looking for matches.  For instance, if one of the formats had the name without the prepended "ATTN: " string, then you would create hashes for each of those.
If your table contained these entries:
ID	ADDRESS1               	ADDRESS2       	ADDRESS3
1	ATTN: JOHN SMITH	123 MAIN STREET	SUITE 200
303	123 MAIN STREET   	SUITE 200     	ATTN: JOHN SMITH 
442	123 MAIN STREET   	SUITE 200      	JOHN SMITH 
2525	JOHN SMITH        	123 MAIN STREET	SUITE 200

Open in new window

Your hash table would contain:
ID	HashOf
1	ATTN: JOHN SMITH^123 MAIN STREET^SUITE 200
1	JOHN SMITH^123 MAIN STREET^SUITE 200
1	123 MAIN STREET^SUITE 200^ATTN: JOHN SMITH
1	123 MAIN STREET^SUITE 200^JOHN SMITH
303	123 MAIN STREET^SUITE 200^ATTN: JOHN SMITH
303	123 MAIN STREET^SUITE 200^JOHN SMITH
303	ATTN: JOHN SMITH$123 MAIN STREET^SUITE 200
303	JOHN SMITH^123 MAIN STREET^SUITE 200
442	123 MAIN STREET^SUITE 200^JOHN SMITH
442	123 MAIN STREET^SUITE 200^ATTN: JOHN SMITH
442	JOHN SMITH^123 MAIN STREET^SUITE 200
442	ATTN: JOHN SMITH^123 MAIN STREET^SUITE 200
2525	JOHN SMITH^123 MAIN STREET^SUITE 200
2525	ATTN: JOHN SMITH^123 MAIN STREET^SUITE 200
2525	123 MAIN STREET^SUITE 200^JOHN SMITH
2525	123 MAIN STREET^SUITE 200^ATTN: JOHN SMITH

Open in new window

Note: I am using the caret character as my delimiter.

In the above example, each record results in two hashes for each of the field/column ordering formats and two for the name formats.
0
 
aikimarkCommented:

Soapbox time:

You are merely at the surface of a deduplication effort in your database.  You should probably normalize the name data, splitting the strings into a possible (title, first, middle, last, suffix) tuple.

Look for multiple records that point to the same address, regardless of name.  If nothing else, you might be able to combine mailings to those addresses.  Thirty+ years ago, I saved my employer a lot of money on mailing by splitting multi-page notices into separate files.  This allowed them to send all the two (3, 4, 5, etc.) page notices in a single envelope to the same address.

You should validate your addresses.  There are third-party services as well as the USPS that can validate an address as well as give you zip+4 data.
0
 
pcelbaCommented:
You may sort columns and then calculate CRC to identify the ones having duplicities:
CREATE TABLE TestAddr (address1 char(50), address2 char(50), address3 char(50), address4 char(50), address5 char(50), crc char(10), hasDuplic logical)
INSERT INTO TestAddr (address1, address2, address3, address4, address5) VALUES ;
  ('ATTN: JOHN SMITH', '123 MAIN STREET', 'SUITE 200', '', '')
INSERT INTO TestAddr (address1, address2, address3, address4, address5) VALUES ;
  ('123 MAIN STREET', 'SUITE 200', 'ATTN: JOHN SMITH', '', '')
INSERT INTO TestAddr (address1, address2, address3, address4, address5) VALUES ;
  ('123 MAIN STREET', 'SUITE 200', 'ATTN: JOHN SMITH', '', '')
INSERT INTO TestAddr (address1, address2, address3, address4, address5) VALUES ;
  ('ATTN: JOHN SMITH', '123 MAIN STREET', 'SUITE 200', '', '')
INSERT INTO TestAddr (address1, address2, address3, address4, address5) VALUES ;
  ('ATTN: JOHN SMITH SR.', '123 MAIN STREET', 'SUITE 201', '', '')

LOCAL laRow[5]

*-- Calculate CRC
SELECT TestAddr
SCAN ALL
  COPY NEXT 1 FIELDS LIKE address? TO ARRAY laRow  && You may also list fields separated by comma here
  *-- Strip punctiation
  FOR lnI = 1 TO ALEN(laRow, 1)
    laRow[m.lnI] = UPPER(CHRTRAN(laRow[m.lnI], [ ,.;<>"'!@#$%^&*()+-_=/*], []))
  NEXT
  *-- Sort columns
  = ASORT(laRow)
  *-- Concatenate columns
  lcStr = ''
  FOR lnI = 1 TO ALEN(laRow, 1)
    lcStr = m.lcStr + laRow[m.lnI]
  NEXT
  *-- Store CRC
  REPLACE crc WITH SYS(2007, m.lcStr, 0, 1)
ENDSCAN

*-- And now you may identify duplicities
SELECT crc, COUNT(*) FROM TestAddr GROUP BY 1 HAVING COUNT(*) > 1 ORDER BY 2 DESC INTO CURSOR cDupl
INDEX ON crc TAG crc

*-- ... and mark original records by flag
SELECT TestAddr
REPLACE ALL hasDuplic WITH SEEK(crc, 'cDupl')

Open in new window

The duplicate records removal is then very easy...
0
 
Olaf DoschkeSoftware DeveloperCommented:
The most important thing to do is define a more strict schema for addresses and put them in there to make them better comparable. You then still have to care for typos, etc.

As the order of the fields can be anything, one way to force two permutated adresses to be equal would be to sort the fields before or even instead af computing a hash. A hash alone isn't making dupes easier to find. It just makes the matching of a shorter value easier, but hash functions have one behavior making them bad for matching similar not exactly matching data: Two slightly different values, eg with a typo or two switched letters, result in totally different hash values.

To sort Address1, Address2, Address3, Address4 and Address5 within a single record would let all permutations of the same lines end in the same sorted address. No matter, whether that results in a usable address or not, eg it would mostly start with the street line then, instead of the name.

There still is a risc you have two same addresses, one with an additional info, one without, eg the ATTN lines, and wouldn't find these dupes, but to remove such info would be something you can add later.

So how about this approach:
*sample data
CREATE CURSOR Addresses (ID I autoinc, Address1 C(254), Address2 C(254), Address3 C(254), Address4 C(254), Address5 C(254))
INSERT INTO Addresses (Address1, Address2, Address3, Address4, Address5) ;
VALUES ("ATTN: JOHN SMITH", "123 MAIN STREET", "", "","SUITE 200")
INSERT INTO Addresses (Address1, Address2, Address3, Address4, Address5) ;
VALUES ("","123 MAIN STREET", "SUITE 200", "", "ATTN: JOHN SMITH")

* Aggregate Adress lines into a single memo
Select ID, CAST(Address1+chr(13)+chr(10)+Address2+chr(13)+chr(10)+Address3+chr(13)+chr(10)+Address4+chr(13)+chr(10)+Address5 as M) as mAddress,;
 Space(11) as crc32 From Addresses Into Cursor curAddresses ReadWrite

* Sort the memo lines and calculate a checksum
Scan
   lcAddress = ""
   ALINES(laAddresslines, mAddress,1+2)
   ASORT(laAddresslines)
   For Each lcLine in laAddresslines
       lcAddress = lcAddress+Upper(lcLine)+Chr(13)+Chr(10)
   Endfor 
   REPLACE mAddress With CHRTRAN(lcAddress,".,-!?+- AEIOU","")
   REPLACE crc32 WITH Sys(2007, mAddress,0,3)
EndScan
Index On crc32 TAG xcrc32

*match by checksum
Select A1.*, A2.* From curAddresses A1;
   Inner Join curAddresses A2 ;
   On  A1.crc32  = A2.crc32 And A1.ID < A2.ID

Open in new window


Indeed now you could also compare the mAddress values themselves instead of a checksum. And Levenshtein distance still is an idea, maybe also remove unimportant characters to make the addresses better comparable, eg remove all commas, dots, spaces etc, even vocals. That's what I do with CHRTRAN(lcAddress,".,-!?+- AEIOU",""), I remove vocals, as they are less significant parts of words and removing them can remove some typos, too. You might then also find some more dupes. So remember this kind of address processing is there to find dupes, not to normalize addresses.

And that let's me conclude: I'm with aikimark, the best thing to do is put the addresses into a schema making them better comparable. Everything else is just trying to cope with your bad data schema.

Bye, Olaf.
0
 
Olaf DoschkeSoftware DeveloperCommented:
Believe it or not, I didn't read pavels answer before making mine and still had quite the same approach.

Bye, Olaf.
0
 
dc070994Author Commented:
Thank you all for your assistance as it was very helpful in resolving my issue
0

Featured Post

Veeam Disaster Recovery in Microsoft Azure

Veeam PN for Microsoft Azure is a FREE solution designed to simplify and automate the setup of a DR site in Microsoft Azure using lightweight software-defined networking. It reduces the complexity of VPN deployments and is designed for businesses of ALL sizes.

  • 4
  • 2
  • 2
  • +3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now