Solved

Oracle SQL - Parsing & Searching Strings

Posted on 2014-01-15
14
533 Views
Last Modified: 2014-01-27
Hi,

I was wondering if there is a way to parse texts (such as business names) in Oracle SQL and search for components. For example, if my search word is '100 General Store Express', I would like to find records that contain all 4 components of my search word - '100', 'General','Store', 'Express' in any order.

So, if my records are :

1. 100 Express General Store
2. 100 General Store Express Way
3. 100 Express General

Then, the search would return only record 1 and 2.

Thank you.
0
Comment
Question by:iamnamja
  • 5
  • 5
  • 4
14 Comments
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39783220
Brute force way:

select * from mytable where
instr(mycolumn,'100') > 0 and  instr(mycolumn,'General')>0 and instr(mycolumn,'Store')>0 and  instr(mycolumn,'Express')>0;


Depending on your actual data/requirements Oracle Text is a special index type that can easily do this and make searching large amounts of text pretty efficient.

If you can provide more information on your requirements we can go from there.
0
 

Author Comment

by:iamnamja
ID: 39783416
Thank you for your comment.
However, my search word list (Table A) contains about 80,000 names, and my target data (Table B) where I have to perform the serach contains about 15 million names.

Target Data (Table B) may contain a little variation of names from Table A, such as special characters, branch number, etc. I tried the following code:

SELECT  A.*, B.* FROM Table A
INNER JOIN Table B
ON REGEXP_REPLACE(UPPER(A.NAME),'[^A-Z0-9 ]','')  LIKE '%' || UPPER(B.NAME) || '%'

Open in new window


However, this returns too many false hits if B.NAME contains A.NAME in any part of the string. So I want to enhance the fuzzy matching logic. Any help would be greatly appreciated.

Thanks.
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39783434
Is this a one-time type of search or will you be searching 80,000 names against 15 million rows on a regular basis?

How often does the 15 million rows change?

I'm leaning towards Oracle Text but using Text isn't as simple as building a normal index.  It requires some hand holding.  I don't want to over-complicate things if I don't need to.
0
Instantly Create Instructional Tutorials

Contextual Guidance at the moment of need helps your employees adopt to new software or processes instantly. Boost knowledge retention and employee engagement step-by-step with one easy solution.

 
LVL 74

Expert Comment

by:sdstuber
ID: 39783442
From your query,  I assume tableB.name looks like this ...

100 Express General Store
100 General Store Express Way
100 Express General

What does tableA.name look like?
0
 

Author Comment

by:iamnamja
ID: 39783501
TableA.name looks like

100 General Store Express
25 Ink, Inc
3566 Food Market
....

Basically, they look same as tableB.name but table A may contain small typos, special characters, store number.. etc.. small variations.

This is a one-time type of search. They are both static data.
0
 
LVL 77

Assisted Solution

by:slightwv (䄆 Netminder)
slightwv (䄆 Netminder) earned 200 total points
ID: 39783503
I'm taking a guess that tableA.name contains '100 General Store Express'

Assuming that is correct, below is a working example of using Oracle Text to return rows 1 & 2.

It uses dynamic SQL.  I couldn't think of a quick way to convert '100 General Store Express' into something Oracle Text could work with.

drop table tab1 purge;
create table tab1(
	name varchar2(100)
);

drop table tab2 purge;
create table tab2(
	id	number,
	name varchar2(100)
);

insert into tab1 values('100 General Store Express');

insert into tab2 values(1,'100 Express General Store');
insert into tab2 values(2,'100 General Store Express Way');
insert into tab2 values(3,'100 Express General ');
commit;

create index name_idx on tab2(name) indextype is ctxsys.context parameters('sync(on commit)');

declare
	mysql varchar2(4000) := 'select id from tab2 where contains(name,:myvar) > 0';
	myresult number;
	mycur sys_refcursor;
begin
	for i in (select name  from tab1) loop
		open mycur for mysql using 'near((' || replace(i.name,' ',',') || '),100,false)';
		fetch mycur into myresult;
		loop
			exit when mycur%notfound;
			dbms_output.put_line('Got: ' || myresult);
			fetch mycur into myresult;
		end loop;
		close mycur;
	end loop;
end;
/

Open in new window

0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39783536
>>but table A may contain small typos, special characters

This could cause mine to barf.  You might want to add your original regexp_replace inside of my replace to get rid of them:

replace(regexp_replace(i.name,),'[^A-Z0-9 ]'),' ',',')

If they can have mumtiple spaces, it will barf as well.  I created my own clean_string function that takes raw data and makes it as pretty as I can before I pass it off to the Text query.

If you want to use Text, I can continue helping but don't want to add a lot of stuff if it will never be used.
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 39783542
80K rows of regexps against 15M rows of regexps will likely take awhile, but this should do the trick  (worked for me anyway on small scale data)


SELECT bname,aname
  FROM (SELECT b.name bname, a.name aname,
               (    SELECT SET(
                               CAST(
                                   COLLECT(
                                       REGEXP_SUBSTR(
                                           REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'),
                                           '[^ ]+',
                                           1,
                                           LEVEL
                                       )
                                   ) AS ora_mining_varchar2_nt
                               )
                           )
                      FROM DUAL
                CONNECT BY LEVEL <=
                               REGEXP_COUNT(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+'))
                   awords,
               (    SELECT SET(
                               CAST(
                                   COLLECT(
                                       REGEXP_SUBSTR(
                                           UPPER(b.name),
                                           '[^ ]+',
                                           1,
                                           LEVEL
                                       )
                                   ) AS ora_mining_varchar2_nt
                               )
                           )
                      FROM DUAL
                CONNECT BY LEVEL <= REGEXP_COUNT(b.name, '[^ ]+'))
                   bwords
          FROM tablea a, tableb b)
 WHERE awords MULTISET INTERSECT bwords = awords
0
 

Author Comment

by:iamnamja
ID: 39783697
slightwv - I'm getting the following error:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 14

sdstuber - I'm getting the following error:
ORA-00904: "REGEXP_COUNT": invalid identifier

Would it be because I have a lower version of Oracle?
0
 
LVL 77

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39783721
>>Would it be because I have a lower version of Oracle?

What is your version of Oracle (all 4 numbers please, like 10.2.0.4)?

>>slightwv - I'm getting the following error:

When running my example test case or trying to incorporate into what you have?
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 39783724
>>> Would it be because I have a lower version of Oracle?

yes

change

CONNECT BY LEVEL <=
                               REGEXP_COUNT(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+')

to

CONNECT BY REGEXP_SUBSTR(UPPER(a.name), '[^A-Z0-9 ]', '[^ ]+',1,LEVEL) is not null



and change

 CONNECT BY LEVEL <= REGEXP_COUNT(b.name, '[^ ]+')

to

CONNECT BY  REGEXP_SUBSTR(b.name, '[^ ]+',1,LEVEL) is not null
0
 
LVL 74

Expert Comment

by:sdstuber
ID: 39783729
also note,  if you don't have a type called ora_mining_varchar2_nt

you may need to create one


create or replace type ora_mining_varchar2_nt as table of varchar2(4000);


you can rename it if you want, just use the same name in CAST
0
 

Author Comment

by:iamnamja
ID: 39783799
sdstuber - I still get this error after changing to what you suggested above:

ORA-01722: invalid number
01722. 00000 -  "invalid number"
*Cause:    
*Action:
0
 
LVL 74

Accepted Solution

by:
sdstuber earned 300 total points
ID: 39783810
oh sorry, in my cut-and-paste I dropped an entire function call


change


CONNECT BY LEVEL <=
                               REGEXP_COUNT(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+')


to

CONNECT BY REGEXP_SUBSTR(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+',1,LEVEL) is not null
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I remember the day when someone asked me to create a user for an application developement. The user should be able to create views and materialized views and, so, I used the following syntax: (CODE) This way, I guessed, I would ensure that use…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
This video explains at a high level about the four available data types in Oracle and how dates can be manipulated by the user to get data into and out of the database.
Via a live example, show how to take different types of Oracle backups using RMAN.

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question