• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 563
  • Last Modified:

Oracle SQL - Parsing & Searching Strings

Hi,

I was wondering if there is a way to parse texts (such as business names) in Oracle SQL and search for components. For example, if my search word is '100 General Store Express', I would like to find records that contain all 4 components of my search word - '100', 'General','Store', 'Express' in any order.

So, if my records are :

1. 100 Express General Store
2. 100 General Store Express Way
3. 100 Express General

Then, the search would return only record 1 and 2.

Thank you.
0
iamnamja
Asked:
iamnamja
  • 5
  • 5
  • 4
2 Solutions
 
slightwv (䄆 Netminder) Commented:
Brute force way:

select * from mytable where
instr(mycolumn,'100') > 0 and  instr(mycolumn,'General')>0 and instr(mycolumn,'Store')>0 and  instr(mycolumn,'Express')>0;


Depending on your actual data/requirements Oracle Text is a special index type that can easily do this and make searching large amounts of text pretty efficient.

If you can provide more information on your requirements we can go from there.
0
 
iamnamjaAuthor Commented:
Thank you for your comment.
However, my search word list (Table A) contains about 80,000 names, and my target data (Table B) where I have to perform the serach contains about 15 million names.

Target Data (Table B) may contain a little variation of names from Table A, such as special characters, branch number, etc. I tried the following code:

SELECT  A.*, B.* FROM Table A
INNER JOIN Table B
ON REGEXP_REPLACE(UPPER(A.NAME),'[^A-Z0-9 ]','')  LIKE '%' || UPPER(B.NAME) || '%'

Open in new window


However, this returns too many false hits if B.NAME contains A.NAME in any part of the string. So I want to enhance the fuzzy matching logic. Any help would be greatly appreciated.

Thanks.
0
 
slightwv (䄆 Netminder) Commented:
Is this a one-time type of search or will you be searching 80,000 names against 15 million rows on a regular basis?

How often does the 15 million rows change?

I'm leaning towards Oracle Text but using Text isn't as simple as building a normal index.  It requires some hand holding.  I don't want to over-complicate things if I don't need to.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
sdstuberCommented:
From your query,  I assume tableB.name looks like this ...

100 Express General Store
100 General Store Express Way
100 Express General

What does tableA.name look like?
0
 
iamnamjaAuthor Commented:
TableA.name looks like

100 General Store Express
25 Ink, Inc
3566 Food Market
....

Basically, they look same as tableB.name but table A may contain small typos, special characters, store number.. etc.. small variations.

This is a one-time type of search. They are both static data.
0
 
slightwv (䄆 Netminder) Commented:
I'm taking a guess that tableA.name contains '100 General Store Express'

Assuming that is correct, below is a working example of using Oracle Text to return rows 1 & 2.

It uses dynamic SQL.  I couldn't think of a quick way to convert '100 General Store Express' into something Oracle Text could work with.

drop table tab1 purge;
create table tab1(
	name varchar2(100)
);

drop table tab2 purge;
create table tab2(
	id	number,
	name varchar2(100)
);

insert into tab1 values('100 General Store Express');

insert into tab2 values(1,'100 Express General Store');
insert into tab2 values(2,'100 General Store Express Way');
insert into tab2 values(3,'100 Express General ');
commit;

create index name_idx on tab2(name) indextype is ctxsys.context parameters('sync(on commit)');

declare
	mysql varchar2(4000) := 'select id from tab2 where contains(name,:myvar) > 0';
	myresult number;
	mycur sys_refcursor;
begin
	for i in (select name  from tab1) loop
		open mycur for mysql using 'near((' || replace(i.name,' ',',') || '),100,false)';
		fetch mycur into myresult;
		loop
			exit when mycur%notfound;
			dbms_output.put_line('Got: ' || myresult);
			fetch mycur into myresult;
		end loop;
		close mycur;
	end loop;
end;
/

Open in new window

0
 
slightwv (䄆 Netminder) Commented:
>>but table A may contain small typos, special characters

This could cause mine to barf.  You might want to add your original regexp_replace inside of my replace to get rid of them:

replace(regexp_replace(i.name,),'[^A-Z0-9 ]'),' ',',')

If they can have mumtiple spaces, it will barf as well.  I created my own clean_string function that takes raw data and makes it as pretty as I can before I pass it off to the Text query.

If you want to use Text, I can continue helping but don't want to add a lot of stuff if it will never be used.
0
 
sdstuberCommented:
80K rows of regexps against 15M rows of regexps will likely take awhile, but this should do the trick  (worked for me anyway on small scale data)


SELECT bname,aname
  FROM (SELECT b.name bname, a.name aname,
               (    SELECT SET(
                               CAST(
                                   COLLECT(
                                       REGEXP_SUBSTR(
                                           REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'),
                                           '[^ ]+',
                                           1,
                                           LEVEL
                                       )
                                   ) AS ora_mining_varchar2_nt
                               )
                           )
                      FROM DUAL
                CONNECT BY LEVEL <=
                               REGEXP_COUNT(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+'))
                   awords,
               (    SELECT SET(
                               CAST(
                                   COLLECT(
                                       REGEXP_SUBSTR(
                                           UPPER(b.name),
                                           '[^ ]+',
                                           1,
                                           LEVEL
                                       )
                                   ) AS ora_mining_varchar2_nt
                               )
                           )
                      FROM DUAL
                CONNECT BY LEVEL <= REGEXP_COUNT(b.name, '[^ ]+'))
                   bwords
          FROM tablea a, tableb b)
 WHERE awords MULTISET INTERSECT bwords = awords
0
 
iamnamjaAuthor Commented:
slightwv - I'm getting the following error:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 14

sdstuber - I'm getting the following error:
ORA-00904: "REGEXP_COUNT": invalid identifier

Would it be because I have a lower version of Oracle?
0
 
slightwv (䄆 Netminder) Commented:
>>Would it be because I have a lower version of Oracle?

What is your version of Oracle (all 4 numbers please, like 10.2.0.4)?

>>slightwv - I'm getting the following error:

When running my example test case or trying to incorporate into what you have?
0
 
sdstuberCommented:
>>> Would it be because I have a lower version of Oracle?

yes

change

CONNECT BY LEVEL <=
                               REGEXP_COUNT(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+')

to

CONNECT BY REGEXP_SUBSTR(UPPER(a.name), '[^A-Z0-9 ]', '[^ ]+',1,LEVEL) is not null



and change

 CONNECT BY LEVEL <= REGEXP_COUNT(b.name, '[^ ]+')

to

CONNECT BY  REGEXP_SUBSTR(b.name, '[^ ]+',1,LEVEL) is not null
0
 
sdstuberCommented:
also note,  if you don't have a type called ora_mining_varchar2_nt

you may need to create one


create or replace type ora_mining_varchar2_nt as table of varchar2(4000);


you can rename it if you want, just use the same name in CAST
0
 
iamnamjaAuthor Commented:
sdstuber - I still get this error after changing to what you suggested above:

ORA-01722: invalid number
01722. 00000 -  "invalid number"
*Cause:    
*Action:
0
 
sdstuberCommented:
oh sorry, in my cut-and-paste I dropped an entire function call


change


CONNECT BY LEVEL <=
                               REGEXP_COUNT(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+')


to

CONNECT BY REGEXP_SUBSTR(REGEXP_REPLACE(UPPER(a.name), '[^A-Z0-9 ]'), '[^ ]+',1,LEVEL) is not null
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 5
  • 5
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now