Link to home
Start Free TrialLog in
Avatar of tparvaiz
tparvaiz

asked on

string matching - advance level

We have two differnet billing systems.
At both systems our data entry clerks enter customer names with the required data.
The only way to match recods b/w these two system is by customer names (so customer name is primary key).

But due to data mismatch we are unable to link one table with other.


mismatching could be like this

1. microsoft
2. microsoft inc

1. oracle inc
2. oracle inc.

1. chrysler of america
2. chrysler of ameria

is there any way we can get the best results (matching)

one way is to do sub string matching (what could be the best string matching algorithm)


what I am trying to do here is to match records and then selct which one is a correct combination.

as an example customer name from one system : "microsoft usa"
might bring up multiple customer names:
"microsoft usa ."
"microsft is not good"
"microsoft is microsoft"

from here we can select the first choice is the one we are looking for.

if applicable, please help me with a sample code


thanks
Avatar of JMelchhammer
JMelchhammer

Hello tparvaiz,

this problem comes along to me every two or three databases I had to design or maintain, and to be honest: of course there is no simple solution.
Anyway, you can program heuristics dependend on your problem and I can give you some suggestions on that.

The first would be to use the function SOUNDEX. It requires the first letter to be correct and builds a kind of index on that.
  SELECT SOUNDEX('Microsoft') FROM DUAL; will produce M262 as does
  SELECT SOUNDEX('Microsft') FROM DUAL;
This will do as first approach. The problem is that the SOUNDEX - function only goes up to 8 letters and ignores the rest, thus
  SELECT SOUNDEX('Microsoft inc.') FROM DUAL;
will give M262 too.

So I created a function for you which will compare a little bit more sophisticated:

-- Given code the meaning of it:
-- 0:   They are completely equal
-- 1:   They are equal but have other caps
-- 2:   The first letters are equal, one has spaces (like 'Microsoft' and 'Microsoft Inc.'
-- 3:   No spaces in both names, soundex is equal
-- 4:   Spaces in both, Soundex from both words are equal
-- 5:   Spaces in both, only Soundex from the first words are equal
-- 9:   No equals found
-- Function to compare two strings to match the nearest
CREATE OR REPLACE
FUNCTION MATCH(CLIENT1 IN VARCHAR2, CLIENT2 IN VARCHAR2) RETURN NUMBER
  IS
  C1_SPACE_POS  NUMBER(38);
  C2_SPACE_POS  NUMBER(38);
BEGIN
  --    Best match first:
  IF CLIENT1 = CLIENT2 THEN
    RETURN 0;
  END IF;
  --    Match with uppercase
  IF UPPER(CLIENT1) = UPPER(CLIENT2) THEN
    RETURN 1;
  END IF;
 
  C1_SPACE_POS:=INSTR(CLIENT1,' ',1,1);
  C2_SPACE_POS:=INSTR(CLIENT2,' ',1,1);

  -- First Client has no space, second has
  IF C1_SPACE_POS = 0 AND
     C2_SPACE_POS != 0 AND
     CLIENT1 = SUBSTR(CLIENT2,1,C2_SPACE_POS-1) THEN
      RETURN 2;
  END IF;

  -- Second Client has no space, second has
  IF C2_SPACE_POS = 0 AND
     C1_SPACE_POS != 0 AND
     CLIENT2 = SUBSTR(CLIENT1,1,C1_SPACE_POS-1) THEN
      RETURN 2;
  END IF;
 
  IF C1_SPACE_POS = 0 AND
     C2_SPACE_POS = 0 THEN
    IF SOUNDEX(CLIENT1) = SOUNDEX(CLIENT2) THEN
      RETURN 3;
    END IF;
    RETURN 9;
  ELSE
    IF SOUNDEX(SUBSTR(CLIENT1,1,C1_SPACE_POS-1)) = SOUNDEX(SUBSTR(CLIENT2,1,C2_SPACE_POS-1)) AND
       SOUNDEX(SUBSTR(CLIENT1,C1_SPACE_POS+1,LENGTH(CLIENT1))) = SOUNDEX(SUBSTR(CLIENT2,C2_SPACE_POS+1,LENGTH(CLIENT2))) THEN
      RETURN 4;
    ELSE
      RETURN 5;
    END IF;
  END IF;
  RETURN 9;
END;
/

You can yet write the following:

SELECT
    C1.NAME,
    C2.NAME,
    MATCH(C1.NAME,C2.NAME)
FROM
    CLIENTSTHIS C1,
    CLIENTSOTHER C2
WHERE MATCH<=5
ORDER BY MATCH

and will get the names sorted by the StoredFunc.


Hope this'll help a little, of course you will eventually make your own sort-criterias, but this should give the idea

jMelch


ASKER CERTIFIED SOLUTION
Avatar of JMelchhammer
JMelchhammer

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of tparvaiz

ASKER

Hi JMelchhammer.

I think your suggestion is a good one. could you please give me examples of your 9 matches like

1 "microsoft" = "microsoft"
2 "Microsoft" = "microsoft"
...


thanks
Just a bit of further explanation

In one of the table (as explained above) I 'll. be storing all the customer names (from one of the system) and I want to retrieve customer name and customer number (from another system).

How to modify the given code  ---- (Sorry fo such a childish question, I am a beginner)

Thanks
Never mind for questions, tparvaiz!

You could do this in two ways: First, the data is in two databases (oracle), you can connect oracledatabases, so that the one can read from the tables of the other one.
If you got to export/import the data, create a new table where you store the data of database 2 and work with it in database 1, this is to get the data into database 1, but not to alter data in database 2.

I saw an error in my PL/SQL. Here is the correct code ("9" never was reached):

-- Given code the meaning of it:
-- 0:   They are completely equal
-- 1:   They are equal but have other caps
-- 2:   The first letters are equal, one has spaces (like 'Microsoft' and 'Microsoft Inc.'
-- 3:   No spaces in both names, soundex is equal
-- 4:   Spaces in both, Soundex from both words are equal
-- 5:   Spaces in both, only Soundex from the first words are equal
-- 9:   No equals found
-- Function to compare two strings to match the nearest
CREATE OR REPLACE
FUNCTION MATCH(CLIENT1 IN VARCHAR2, CLIENT2 IN VARCHAR2) RETURN NUMBER
  IS
  C1_SPACE_POS  NUMBER(38);
  C2_SPACE_POS  NUMBER(38);
BEGIN
  --    Best match first:
  IF CLIENT1 = CLIENT2 THEN
    RETURN 0;
  END IF;
  --    Match with uppercase
  IF UPPER(CLIENT1) = UPPER(CLIENT2) THEN
    RETURN 1;
  END IF;
 
  C1_SPACE_POS:=INSTR(CLIENT1,' ',1,1);
  C2_SPACE_POS:=INSTR(CLIENT2,' ',1,1);

  -- First Client has no space, second has
  IF C1_SPACE_POS = 0 AND
     C2_SPACE_POS != 0 AND
     CLIENT1 = SUBSTR(CLIENT2,1,C2_SPACE_POS-1) THEN
      RETURN 2;
  END IF;

  -- Second Client has no space, second has
  IF C2_SPACE_POS = 0 AND
     C1_SPACE_POS != 0 AND
     CLIENT2 = SUBSTR(CLIENT1,1,C1_SPACE_POS-1) THEN
      RETURN 2;
  END IF;
 
  IF C1_SPACE_POS = 0 AND
     C2_SPACE_POS = 0 THEN
    IF SOUNDEX(CLIENT1) = SOUNDEX(CLIENT2) THEN
      RETURN 3;
    END IF;
    RETURN 9;
  ELSE
    IF SOUNDEX(SUBSTR(CLIENT1,1,C1_SPACE_POS-1)) = SOUNDEX(SUBSTR(CLIENT2,1,C2_SPACE_POS-1)) AND
       SOUNDEX(SUBSTR(CLIENT1,C1_SPACE_POS+1,LENGTH(CLIENT1))) = SOUNDEX(SUBSTR(CLIENT2,C2_SPACE_POS+1,LENGTH(CLIENT2))) THEN
      RETURN 4;
    ELSE
      IF SOUNDEX(SUBSTR(CLIENT1,1,C1_SPACE_POS-1)) = SOUNDEX(SUBSTR(CLIENT2,1,C2_SPACE_POS-1)) AND
         SOUNDEX(SUBSTR(CLIENT1,C1_SPACE_POS+1,LENGTH(CLIENT1))) != SOUNDEX(SUBSTR(CLIENT2,C2_SPACE_POS+1,LENGTH(CLIENT2))) THEN
        RETURN 5;
      END IF;
    END IF;
  END IF;
  RETURN 9;
END;
/

OK, testing:

I created a table like:
create table cust (c1 varchar2(64), c2 varchar2(64))
/
inserted values like:
insert into cust values ('Microsoft', 'Microsoft')
/
insert into cust values ('Microsoft', 'microsoft')
/
insert into cust values ('Microsoft', 'Microsoft Inc.');
/
insert into cust values ('Microsoft', 'Microsft');
/
insert into cust values ('Microsoft Incorparated', 'Microsoft Incorporated');
/
insert into cust values ('Microsoft Inc.', 'Microsoft Dept');
/
insert into cust values ('Microsoft', 'Is not the same as microsoft');
/
selected like
select c1,c2, match(c1,c2) from cust
/
the result:
C1                       C2                          MATCH(C1,C2)
------------------------ ----------------------------
Microsoft                Microsoft                              0
Microsoft                microsoft                              1
Microsoft                Microsoft Inc.                         2
Microsoft                Microsoft Inc.                         2
Microsoft                Microsft                               3
Microsoft                Microsft                               3
Microsoft Incorparated   Microsoft Incorporated                 4
Microsoft Incorparated   Microsoft Incorporated                 4
Microsoft Inc.           Microsoft Dept                         5
Microsoft Inc.           Microsoft Dept                         5
Microsoft                Is not the same as microsoft           9
Microsoft                Is not the same as microsoft           9

I think, this helps,

good luck

jMelch
HELLO AGAIN,
I JUST FOUND OUT THAT I DONT HAVE SUFFICIENT PRIVILIGES TO CREATE FUNCTIONS.

GETTING THIS ERROR MESSAGE:

CREATE OR REPLACE PROCEDURE MATCH(CLIENT1 IN VARCHAR2, CLIENT2 IN VARCHAR2) RETURN NUMBER
*
ERROR at line 1:
ORA-01031: insufficient privileges


IS THERE ANY OTHER WAY ARROUND
Hello tparvaiz,

you got to get the rights to create functions, anyway. You wont have the chance to make such a sophisticated select-method without PL/SQL.
Never mind. Either you ask your DBA (database-administrator) to create the function for you or you can try (in a test-system) to connect to oracle with user "sys", password "change_on_install" and grant your user DBA rights ("GRANT DBA TO SCOTT" as example for user scott).

That should suffice.

Greetings

jMelch
Hi JMelchhammer,
I've got the required priviliges

but now I am getting this error message

Warning: Function created with compilation errors.
Please ignore my first comments:

----I am having problems in this part of code



                     SELECT
                        C1.CUST_NAME,
                        C2.CUST_NAME,
                        MATCH(C1.CUST_NAME,C2.CUST_NAME)
                     FROM
                       temp1 C2,
                       TEMP2 C1
                     WHERE MATCH<=5;


----------here is the error message

                     WHERE MATCH<=5
                           *
ERROR at line 8:
ORA-06553: PLS-306: wrong number or types of arguments in call to 'MATCH'
Hello tparvaiz,

try it with "WHERE MATCH(C1.CUST_NAME,C2.CUST_NAME)"
this should do. You allways have to pass the two arguments to that function (I didn't because I did not know your columnnames... :-).

Good luck!

jMelch
Sorry again: "WHERE MATCH(C1.CUST_NAME,C2.CUST_NAME)<=5", of course...;-)

jMelch
Hi jMelch,
Your suggestion is working fine for smaller set of records.

But for any larger table, It's running running and returning nothing.

Could you please modify your suggestion so that I should be able to see something.

Thanks again jMelch, if you do have solution for my problem, then I'll open another question.