Julie Kurpa
asked on
Finding special characters for Printjoin in Oracle (DRG-51030)
Hello,
I have recently learned about Oracle Text and have implemented it to help a developer with his wildcard queries.
It's working very well and the queries are super fast.
However, it seems there are failures for some of the queries. The developer can see the errors in the apache logs. One user was helpful enough to call and tell us exactly what they were searching for and now we can produce the error ourselves.
Here is the full error:
select parid, owner, addrfull from schema.search_rp where contains(TEXT_MULTI_COL_IDX, '%2 MYSTREET%') > 0 offset 0 rows fetch
next 25 rows only
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-51030: wildcard query expansion resulted in too many terms
I see that one solution is to increase the "WILDCARD_MAXTERMS" from the default which is 20000.
Per an Oracle document, I increased it to 50000 which was bad and caused all our server resources to max out. So I put it back to 20000.
So I'm thinking that perhaps I need to see if I should modify the Printjoin to identify more special characters. Perhaps the data has some characters that is causing it to split the data too much.
Here is my printjoin which should not split on @\&.-,()#"_'
begin
ctx_ddl.create_preference('WEBDATA_LEX', 'BASIC_LEXER');
ctx_ddl.set_attribute('WEBDATA_LEX','printjoins','@\&.-,()#"_''');
end;
/
I'm not confident it's working properly and not confident that I've identified all the non-alphabetic and non-numeric values.
Perhaps this is not even my issue and it's simply the spaces in the data (which I need).
Can someone assist me in 2 ways:
1. Query all columns of the table for anything that is not a space, Alpha or Numeric character?
2. Make sure my printjoin is correct to account for all those special characters?
I realize it's possible there may be "unprintable" characters in the data that may need to be fixed. Maybe a user pasted something into a field. If I could identify those also, it would be great!
I have recently learned about Oracle Text and have implemented it to help a developer with his wildcard queries.
It's working very well and the queries are super fast.
However, it seems there are failures for some of the queries. The developer can see the errors in the apache logs. One user was helpful enough to call and tell us exactly what they were searching for and now we can produce the error ourselves.
Here is the full error:
select parid, owner, addrfull from schema.search_rp where contains(TEXT_MULTI_COL_IDX, '%2 MYSTREET%') > 0 offset 0 rows fetch
next 25 rows only
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-51030: wildcard query expansion resulted in too many terms
I see that one solution is to increase the "WILDCARD_MAXTERMS" from the default which is 20000.
Per an Oracle document, I increased it to 50000 which was bad and caused all our server resources to max out. So I put it back to 20000.
So I'm thinking that perhaps I need to see if I should modify the Printjoin to identify more special characters. Perhaps the data has some characters that is causing it to split the data too much.
Here is my printjoin which should not split on @\&.-,()#"_'
begin
ctx_ddl.create_preference('WEBDATA_LEX', 'BASIC_LEXER');
ctx_ddl.set_attribute('WEBDATA_LEX','printjoins','@\&.-,()#"_''');
end;
/
I'm not confident it's working properly and not confident that I've identified all the non-alphabetic and non-numeric values.
Perhaps this is not even my issue and it's simply the spaces in the data (which I need).
Can someone assist me in 2 ways:
1. Query all columns of the table for anything that is not a space, Alpha or Numeric character?
2. Make sure my printjoin is correct to account for all those special characters?
I realize it's possible there may be "unprintable" characters in the data that may need to be fixed. Maybe a user pasted something into a field. If I could identify those also, it would be great!
To get an idea of what you asked Text to return, you can count the tokens that match:
Where tab1_text_idx is your index name.
select count(*)
from dr$tab1_text_idx$i
where token_text like '%2';
Where tab1_text_idx is your index name.
ASKER
On your question:
Can you confirm that when searching the '2 mystreet, you actually want '1234567892 mystreetThatHasAReallyLongName'
Yes, the developer wants all records that contain "2 mystreet" anywhere in the data to be in the result set.
Can you confirm that when searching the '2 mystreet, you actually want '1234567892 mystreetThatHasAReallyLongName'
Yes, the developer wants all records that contain "2 mystreet" anywhere in the data to be in the result set.
Seems like the developer may need to adjust their thinking. ;)
How many tokens come back from the DR$ query I posted above?
You might need to max out WILDCARD_MAXTERMS and just have to deal with it.
I can tell you from personal experience that street addresses are pretty much impossible to get accurately in searches. I spent YEARS trying to get it as accurate as possible.
For example text comes in as:
1 Main Street
1 Main Str
1 Main St
What is just "Street", then you have Avenue, Lane, Drive, Circle, ...
and things like: North, South, East, West, NorthEast, NE N.E......
How many tokens come back from the DR$ query I posted above?
You might need to max out WILDCARD_MAXTERMS and just have to deal with it.
I can tell you from personal experience that street addresses are pretty much impossible to get accurately in searches. I spent YEARS trying to get it as accurate as possible.
For example text comes in as:
1 Main Street
1 Main Str
1 Main St
What is just "Street", then you have Avenue, Lane, Drive, Circle, ...
and things like: North, South, East, West, NorthEast, NE N.E......
ASKER
I'm trying to do that command to count the tokens and it's telling me the index doesn't exist. But it does.
This is what I used to create the index:
create index myschema.search_rp_text_idx on webdata.search_rp(TEXT_MULTI_COL_IDX) indextype is ctxsys.context
parameters('WordList WEBDATA_PREF DataStore WEBDATA_RP_REFERRAL_PREF lexer WEBDATA_LEX section group WEBDATA_RP_SECTION_GROUP sync(on commit)');
Here is the index in dba_indexes:
SQL> select OWNER,INDEX_NAME from dba_indexes where table_name='SEARCH_RP';
OWNER INDEX_NAME
----------------- ------------------------------------------
MYSCHEMA SEARCH_RP_TEXT_IDX
SQL> select count(*)
2 from myschema.search_rp_text_idx
3 where token_text like '%2 MYSTREET%';
from myschema.search_rp_text_idx
*
ERROR at line 2:
ORA-00942: table or view does not exist
This is what I used to create the index:
create index myschema.search_rp_text_idx on webdata.search_rp(TEXT_MULTI_COL_IDX) indextype is ctxsys.context
parameters('WordList WEBDATA_PREF DataStore WEBDATA_RP_REFERRAL_PREF lexer WEBDATA_LEX section group WEBDATA_RP_SECTION_GROUP sync(on commit)');
Here is the index in dba_indexes:
SQL> select OWNER,INDEX_NAME from dba_indexes where table_name='SEARCH_RP';
OWNER INDEX_NAME
----------------- ------------------------------------------
MYSCHEMA SEARCH_RP_TEXT_IDX
SQL> select count(*)
2 from myschema.search_rp_text_idx
3 where token_text like '%2 MYSTREET%';
from myschema.search_rp_text_idx
*
ERROR at line 2:
ORA-00942: table or view does not exist
The dr$ and $i are important... I said to just replace the index name.
I also didn't post to do the entire string. the $i table is the individual indexed tokens not 'phrases'.
If you want to see both:
I also didn't post to do the entire string. the $i table is the individual indexed tokens not 'phrases'.
select count(*)
from dr$search_rp_text_idx$i
where token_text like '%2';
If you want to see both:
select count(*)
from dr$search_rp_text_idx$i
where token_text like '%2'
or token_text like 'mystreet%';
Just keep thinking "words" (called Tokens) not "strings" when dealing with Text.
ASKER
oh duh.
Here's the output:
SQL> select count(*)
2 from myschema.dr$search_rp_text_idx$i
3 where token_text like '%2';
COUNT(*)
----------
37011
I suppose this means I would need to increase the WILDCARD_MAXTERMS to at least 37011 ?
yesterday I increased it to 50000 and within a few minutes, the database server was so stressed that I couldn't even RDP to it. The website couldn't connect to the database (no listener). And that wasn't even during peak usage times.
Here's the output:
SQL> select count(*)
2 from myschema.dr$search_rp_text_idx$i
3 where token_text like '%2';
COUNT(*)
----------
37011
I suppose this means I would need to increase the WILDCARD_MAXTERMS to at least 37011 ?
yesterday I increased it to 50000 and within a few minutes, the database server was so stressed that I couldn't even RDP to it. The website couldn't connect to the database (no listener). And that wasn't even during peak usage times.
>>I suppose this means I would need to increase the WILDCARD_MAXTERMS to at least 37011 ?
Well to get '%2', it looks that way. BUT what about and other single character wildcard possibilities? 'a%' or '%a' and all other possibilities.
>>, the database server was so stressed that I couldn't even RDP to it.
So, you set maxterms and attempted to recreate the index? not sure I remember it ever crippling my database server but it will take extra resources...
It's been a few years since I had to dig around the inner workings of Text.
You might experiment with removing SUBSTRING_INDEX from the preference. It helps speed up wildcard queries but, going from memory, generates MANY more tokens.
Well to get '%2', it looks that way. BUT what about and other single character wildcard possibilities? 'a%' or '%a' and all other possibilities.
>>, the database server was so stressed that I couldn't even RDP to it.
So, you set maxterms and attempted to recreate the index? not sure I remember it ever crippling my database server but it will take extra resources...
It's been a few years since I had to dig around the inner workings of Text.
You might experiment with removing SUBSTRING_INDEX from the preference. It helps speed up wildcard queries but, going from memory, generates MANY more tokens.
ASKER
it seems like the single number is my problem because if I exclude it and only use letters or I add another number to it, it runs fine.
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%2 MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-51030: wildcard query expansion resulted in too many terms
Yet these two queries run fine:
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%22 MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%QUIET MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%2 MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-51030: wildcard query expansion resulted in too many terms
Yet these two queries run fine:
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%22 MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
select parid, owner, addrfull from myschema.search_rp where contains(TEXT_MULTI_COL_IDX, '%QUIET MYSTREET%') > 0 offset 0 rows fetch next 25 rows only;
ASKER
For yesterday, I set the MAXTERMS and recreated the indexes just fine (in fact, I recreated all the preferences including the basic_lexer).
It was when the public started using the website that within a few minutes we were maxed out in resources. Lowering the value back to 20000 and booting the server and all returned to normal (well...minus the errors for some of the queries).
It was when the public started using the website that within a few minutes we were maxed out in resources. Lowering the value back to 20000 and booting the server and all returned to normal (well...minus the errors for some of the queries).
Remember, think "words" (tokens).
It's all about the wildcards applied to the individual tokens.
It also isn't the specific number of characters with the wildcard:
Bet this runs: contains(TEXT_MULTI_COL_ID X, '%Z')
Single character after the wildcard but I'm betting very few words END in with 'Z'.
The more and different characters you have with the wildcard, the fewer tokens that will match.
How many words in the Bible start with: 'T' vs how many that start with 'TX'? One character difference but a huge impact in the words returned.
It's all about the wildcards applied to the individual tokens.
It also isn't the specific number of characters with the wildcard:
Bet this runs: contains(TEXT_MULTI_COL_ID
Single character after the wildcard but I'm betting very few words END in with 'Z'.
The more and different characters you have with the wildcard, the fewer tokens that will match.
How many words in the Bible start with: 'T' vs how many that start with 'TX'? One character difference but a huge impact in the words returned.
ASKER
Yeah I'm having trouble with this concept of tokens. I appreciate your patience and time.
Based on the information below, can you help me visualize what the Text Index would look like?
Here is the table the developer created for use with his wildcard searches.
SQL> desc myschema.search_rp
Name Null? Type
----------------------------------------- -------- -----------------
PARID VARCHAR2(30)
ALTID VARCHAR2(30)
LUC VARCHAR2(4)
OWNER VARCHAR2(205)
ADDRFULL VARCHAR2(184)
ADDRNODIR VARCHAR2(151)
ADDRDIR VARCHAR2(154)
ADDRFULLDIR VARCHAR2(192)
TEXT_MULTI_COL_IDX VARCHAR2(1)
He populates this table using data from another table where there is only one version of the person's address.
For example:
617 S MOVIE DR HOLLY WOOD
But he takes variations of that address and writes it to his search table like this:
PARID ALTID LUC
------------------------------ ------------------------------ ----
OWNER
-------------------------------------------------------------------------------
ADDRFULL
-------------------------------------------------------------------------------
ADDRNODIR
-------------------------------------------------------------------------------
ADDRDIR
-------------------------------------------------------------------------------
ADDRFULLDIR
-------------------------------------------------------------------------------
TEXT_MULTI_COL_IDX
------------------
4545454 454545454545 0100
CHAN JACKIE KICK
617 S MOVIE DR HOLLY WOOD
617 MOVIE DR HOLLY WOOD
617 S MOVIE DR HOLLY WOOD
617 SOUTH MOVIE DR HOLLY WOOD
When he searches using the leading and trailing wildcard, it will search OWNER, ADDRFULL, ADDRNODIR, ADDRDIR and ADDRFULLDIR for whatever text the user enters. Regardless of whether they intend to search on a name or address.
Where my Text index is built based on all the columns above as follows:
begin
ctx_ddl.create_preference('WEBDATA_RP_REFERRAL_PREF', 'MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('WEBDATA_RP_REFERRAL_PREF','COLUMNS','parid, owner, addrfull, addrnodir, addrdir, addrfulldir');
end;
/
begin
ctx_ddl.create_section_group('WEBDATA_RP_SECTION_GROUP', 'BASIC_SECTION_GROUP');
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'owner', 'owner', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrfull', 'addrfull', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrnodir', 'addrnodir', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrdir', 'addrdir', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrfulldir', 'addrfulldir', TRUE);
end;
/
create index webdata.search_rp_text_idx on webdata.search_rp(TEXT_MULTI_COL_IDX) indextype is ctxsys.context
parameters('WordList WEBDATA_PREF DataStore WEBDATA_RP_REFERRAL_PREF lexer WEBDATA_LEX section group WEBDATA_RP_SECTION_GROUP sync(on commit)');
Based on the information below, can you help me visualize what the Text Index would look like?
Here is the table the developer created for use with his wildcard searches.
SQL> desc myschema.search_rp
Name Null? Type
----------------------------------------- -------- -----------------
PARID VARCHAR2(30)
ALTID VARCHAR2(30)
LUC VARCHAR2(4)
OWNER VARCHAR2(205)
ADDRFULL VARCHAR2(184)
ADDRNODIR VARCHAR2(151)
ADDRDIR VARCHAR2(154)
ADDRFULLDIR VARCHAR2(192)
TEXT_MULTI_COL_IDX VARCHAR2(1)
He populates this table using data from another table where there is only one version of the person's address.
For example:
617 S MOVIE DR HOLLY WOOD
But he takes variations of that address and writes it to his search table like this:
PARID ALTID LUC
------------------------------ ------------------------------ ----
OWNER
-------------------------------------------------------------------------------
ADDRFULL
-------------------------------------------------------------------------------
ADDRNODIR
-------------------------------------------------------------------------------
ADDRDIR
-------------------------------------------------------------------------------
ADDRFULLDIR
-------------------------------------------------------------------------------
TEXT_MULTI_COL_IDX
------------------
4545454 454545454545 0100
CHAN JACKIE KICK
617 S MOVIE DR HOLLY WOOD
617 MOVIE DR HOLLY WOOD
617 S MOVIE DR HOLLY WOOD
617 SOUTH MOVIE DR HOLLY WOOD
When he searches using the leading and trailing wildcard, it will search OWNER, ADDRFULL, ADDRNODIR, ADDRDIR and ADDRFULLDIR for whatever text the user enters. Regardless of whether they intend to search on a name or address.
Where my Text index is built based on all the columns above as follows:
begin
ctx_ddl.create_preference('WEBDATA_RP_REFERRAL_PREF', 'MULTI_COLUMN_DATASTORE');
ctx_ddl.set_attribute('WEBDATA_RP_REFERRAL_PREF','COLUMNS','parid, owner, addrfull, addrnodir, addrdir, addrfulldir');
end;
/
begin
ctx_ddl.create_section_group('WEBDATA_RP_SECTION_GROUP', 'BASIC_SECTION_GROUP');
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'owner', 'owner', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrfull', 'addrfull', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrnodir', 'addrnodir', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrdir', 'addrdir', TRUE);
ctx_ddl.add_field_section('WEBDATA_RP_SECTION_GROUP', 'addrfulldir', 'addrfulldir', TRUE);
end;
/
create index webdata.search_rp_text_idx on webdata.search_rp(TEXT_MULTI_COL_IDX) indextype is ctxsys.context
parameters('WordList WEBDATA_PREF DataStore WEBDATA_RP_REFERRAL_PREF lexer WEBDATA_LEX section group WEBDATA_RP_SECTION_GROUP sync(on commit)');
>>Yeah I'm having trouble with this concept of tokens.
This words not strings. The address is stored as a string in the column "617 S MOVIE DR HOLLY WOOD"
Text deals mostly in the individual words.
Take the simple table example I provided, put that single address in TAB1 and go look at the token_text in the DR$...$I table to see the "words" actually indexed.
>>But he takes variations of that address and writes it to his search table like this:
You really shouldn't need to do that. Adds a LOT of extra data to index for no real benefit.
Remember in your other question, I mentioned I created a search_prep function that manipulated what the end user entered to search and what I actually searched?
Well, address scrubbing was part of that.
I also didn't use PHRASE searching. I took whatever the user entered and converted it to use NEAR:
https://docs.oracle.com/en/database/oracle/oracle-database/18/ccref/oracle-text-CONTAINS-query-operators.html#GUID-70B60F42-A268-4A35-9297-D7A351ABA47E
I had a filter_words table that had everything I wanted to remove.
Similar to this:
I also removed single characters and words in the Text stopwords list: There is a CTX_STOPWORDS table.
So I would let Text index the single address "617 S MOVIE DR HOLLY WOOD"
Then when the user would search for "617 SOUTH MOVIE DR HOLLY WOOD", by the time it came out of my prep work, what I would pass off to contains is:
contains('near((617,MOVIE, HOLLY,WOOD ),5,true)' ) > 0
So, no need to manually recreate all the other options the developer thinks may ne necessary.
Word proximity and preserve order was controlled by my search app and the end user could set those..
This words not strings. The address is stored as a string in the column "617 S MOVIE DR HOLLY WOOD"
Text deals mostly in the individual words.
Take the simple table example I provided, put that single address in TAB1 and go look at the token_text in the DR$...$I table to see the "words" actually indexed.
>>But he takes variations of that address and writes it to his search table like this:
You really shouldn't need to do that. Adds a LOT of extra data to index for no real benefit.
Remember in your other question, I mentioned I created a search_prep function that manipulated what the end user entered to search and what I actually searched?
Well, address scrubbing was part of that.
I also didn't use PHRASE searching. I took whatever the user entered and converted it to use NEAR:
https://docs.oracle.com/en/database/oracle/oracle-database/18/ccref/oracle-text-CONTAINS-query-operators.html#GUID-70B60F42-A268-4A35-9297-D7A351ABA47E
I had a filter_words table that had everything I wanted to remove.
Similar to this:
insert into filter_words values('street');
insert into filter_words values('st');
insert into filter_words values('str');
insert into filter_words values('boulevard');
insert into filter_words values('blvd');
insert into filter_words values('lane');
insert into filter_words values('route');
insert into filter_words values('rt');
insert into filter_words values('box');
insert into filter_words values('bx');
insert into filter_words values('rd');
insert into filter_words values('road');
insert into filter_words values('avenue');
insert into filter_words values('ave');
insert into filter_words values('drive');
insert into filter_words values('apartment');
insert into filter_words values('apt');
insert into filter_words values('north');
insert into filter_words values('south');
insert into filter_words values('west');
insert into filter_words values('east');
insert into filter_words values('p.o.');
insert into filter_words values('p.o');
i
I also removed single characters and words in the Text stopwords list: There is a CTX_STOPWORDS table.
So I would let Text index the single address "617 S MOVIE DR HOLLY WOOD"
Then when the user would search for "617 SOUTH MOVIE DR HOLLY WOOD", by the time it came out of my prep work, what I would pass off to contains is:
contains('near((617,MOVIE,
So, no need to manually recreate all the other options the developer thinks may ne necessary.
Word proximity and preserve order was controlled by my search app and the end user could set those..
Here's the quick and dirty test:
I get:
So Text took that "string" and broke it up into 5 "tokens". I still convert the term tokens to words in my mind.... makes it easier to keep it straight.
drop table tab1 purge;
create table tab1(col1 varchar2(50));
insert into tab1 values('617 S MOVIE DR HOLLY WOOD');
commit;
create index tab1_text_idx on tab1(col1) indextype is ctxsys.context;
select token_text from dr$tab1_text_idx$i;
I get:
617
DR
HOLLY
MOVIE
WOOD
So Text took that "string" and broke it up into 5 "tokens". I still convert the term tokens to words in my mind.... makes it easier to keep it straight.
ASKER
wow. I'm going to try to set this up myself and see if I can get things to run. Will update you prolly Monday (unless I have yet another question).
If you are going to play with removing filter words: Pay attention that you just cannot do a simple replace. You need to replace "words".
For example: You have "STR" in your filter list, you may have regular text in the search like "Stradivarius" If you did a simple replace, you would replace the "STR" and end up with "adivarius". Probably not what you want...
For example: You have "STR" in your filter list, you may have regular text in the search like "Stradivarius" If you did a simple replace, you would replace the "STR" and end up with "adivarius". Probably not what you want...
ASKER
would there be a way to filter only if the "STR" has spaces around it so to distinguish between an independent word and it being part of a word?
You'll also need to account for the special terms like: NT, PT, TR, NEAR, etc...
If you search for "NT" in your current search, I expect it to error...
If you search for "NT" in your current search, I expect it to error...
>>would there be a way to filter only if the "STR" has spaces around it so to distinguish
Yes but not just spaces. What about a <tab> character or other whitespace character?
I used regular expressions and regexp_replace.
Before you ask, no, I really cannot post my complete code...
I used a PL/SQL cursor loop to go through my filter words and used:
That results in repeating spaces in the results, used another replace to remove repeating spaces with a single space.
Yes but not just spaces. What about a <tab> character or other whitespace character?
I used regular expressions and regexp_replace.
Before you ask, no, I really cannot post my complete code...
I used a PL/SQL cursor loop to go through my filter words and used:
new_str := regexp_replace(new_str, '((^(\s|%)*|(\s|%)+)(' || c1.filter_word|| '))+((\s|%)+|(\s|%)*$)', ' ');
That results in repeating spaces in the results, used another replace to remove repeating spaces with a single space.
ASKER
I'm continuing to wrap my head around this. Not very easy. Thank you for your patience.
I've created a Stoplist preference and listed items like "ST", "STREET", "DR", and "DRIVE" for example. Then recreated my index using the Stoplist.
My query is structured this way:
select parid, owner, coalesce( addrfull, 'NO ADDRESS' ) addrfull from schema.kurpa_TEST where contains(TEXT_MULTI_COL_IDX, 'near((2,FOREST),2,true)');
It's not finding as many records as it should.
The result set is showing only those with "2 FOREST" rather than showing others such as "22 FOREST, 102 FOREST, etc".
I guess I'm doing something wrong or I'm not grasping the concept.
I've created a Stoplist preference and listed items like "ST", "STREET", "DR", and "DRIVE" for example. Then recreated my index using the Stoplist.
My query is structured this way:
select parid, owner, coalesce( addrfull, 'NO ADDRESS' ) addrfull from schema.kurpa_TEST where contains(TEXT_MULTI_COL_IDX, 'near((2,FOREST),2,true)');
It's not finding as many records as it should.
The result set is showing only those with "2 FOREST" rather than showing others such as "22 FOREST, 102 FOREST, etc".
I guess I'm doing something wrong or I'm not grasping the concept.
Not sure I would recommend and actual stopword list. You might have times you want to search for STR for things other than addresses.
NEAR doesn't remove the need for wildcards:
NEAR doesn't remove the need for wildcards:
drop table tab1 purge;
create table tab1(col1 varchar2(50));
insert into tab1 values('22 forest');
insert into tab1 values('2 forest');
insert into tab1 values('22 forestlawn');
commit;
create index tab1_text_idx on tab1(col1) indextype is ctxsys.context;
select * from tab1 where contains(col1,'near((%2%,%FOREST%),2,true)') > 0;
ASKER
oh duh. :)
I did it with the wildcards and still got the "DRG-51030: wildcard query expansion resulted in too many terms".
The developer has told me that he's programmatically made it so that if it gets the error, it automatically tries the alternative query which is to test each column with the wild card instead of using the text index.
select * from search_rp where
parid like :search_term or
owner like :search_term or
addrfull like :search_term or
addrnodir like :search_term or
addrdir like :search_term or
addrfulldir like :search_term )
Maybe this is about the best we can do here given my skill level.
I did it with the wildcards and still got the "DRG-51030: wildcard query expansion resulted in too many terms".
The developer has told me that he's programmatically made it so that if it gets the error, it automatically tries the alternative query which is to test each column with the wild card instead of using the text index.
select * from search_rp where
parid like :search_term or
owner like :search_term or
addrfull like :search_term or
addrnodir like :search_term or
addrdir like :search_term or
addrfulldir like :search_term )
Maybe this is about the best we can do here given my skill level.
>>I did it with the wildcards and still got the "DRG-51030: wildcard query expansion resulted in too many terms"
It was sort of expected. It still has to expand the same '%2'.
What I was thinking though, is if you got rid of the staging table that stored all the variations and just indexed the main table, the number of indexed tokens would be smaller and might get you past the 51030?
I would look at maxing out the max_terms and figure out what caused the server to spike. It might just be undersized?
On the system where I used Text, I had it set to the max of 50,000 and didn't notice performance issues but I did a lot of scrubbing of the input search terms and wouldn't let a '%2%' get searched. Does returning anything with a 2 in it really make sense?
Instead of a whole replace of either Text of ALL LIKE, a hybrid?
Use Text to search for all the terms > two characters and take all the one and two character terms for the LIKE statement?
Something like:
where contains(col1,'%FOREST%') and (
col1 like '%2' or col2 like '%or%' ...
)
It was sort of expected. It still has to expand the same '%2'.
What I was thinking though, is if you got rid of the staging table that stored all the variations and just indexed the main table, the number of indexed tokens would be smaller and might get you past the 51030?
I would look at maxing out the max_terms and figure out what caused the server to spike. It might just be undersized?
On the system where I used Text, I had it set to the max of 50,000 and didn't notice performance issues but I did a lot of scrubbing of the input search terms and wouldn't let a '%2%' get searched. Does returning anything with a 2 in it really make sense?
Instead of a whole replace of either Text of ALL LIKE, a hybrid?
Use Text to search for all the terms > two characters and take all the one and two character terms for the LIKE statement?
Something like:
where contains(col1,'%FOREST%') and (
col1 like '%2' or col2 like '%or%' ...
)
ASKER
>> I would look at maxing out the max_terms and figure out what caused the server to spike. It might just be undersized?
Possibly. The database server has 32G Memory and 4 CPU.
I'm not sure about indexing the main table. I can make a copy of it and keep fiddling with datastores to see what I can do.
Possibly. The database server has 32G Memory and 4 CPU.
I'm not sure about indexing the main table. I can make a copy of it and keep fiddling with datastores to see what I can do.
>>I'm not sure about indexing the main table.
Well, a copy to experimentation would be good. Better yet, a test machine...
If you are indexing the jumbled up table, you are indexing WAY more tokens than necessary for no real benefit.
Well, a copy to experimentation would be good. Better yet, a test machine...
If you are indexing the jumbled up table, you are indexing WAY more tokens than necessary for no real benefit.
I was thinking about this last night.
You really need to get the developer to rethink their approach and think like a user. I also think they are sticking to their experiences using LIKE and the old way with still adding a leading and trailing wildcard to the entire search string.
If I'm a user and want to find an address: 2 forest
and I get results back like:
98765432 forestlawn
57892 forestgreen
Not sure how useful that will be to me as a user.
If the developer things they are, then why not: 234567890 greenforest or 999992999999 deforested
I mean it has a 2 and forest in it but the old standard: '%2 forest%' wouldn't find it.
So, encourage the developer to give up the old LIKE methods and embrace the Text Token approach.
You really need to get the developer to rethink their approach and think like a user. I also think they are sticking to their experiences using LIKE and the old way with still adding a leading and trailing wildcard to the entire search string.
If I'm a user and want to find an address: 2 forest
and I get results back like:
98765432 forestlawn
57892 forestgreen
Not sure how useful that will be to me as a user.
If the developer things they are, then why not: 234567890 greenforest or 999992999999 deforested
I mean it has a 2 and forest in it but the old standard: '%2 forest%' wouldn't find it.
So, encourage the developer to give up the old LIKE methods and embrace the Text Token approach.
To add more information to help get the developer away from their current path:
I let the users provide their own wildcards because they knew what they wanted and it wasn't my place to "guess" for them.
I trapped the DRG-51030 and provided a nice error message: Wildcard too vague.
When they would contact me and I asked what they searched and I saw things like '%2' I explained to them that likely wasn't what they really wanted given the amount of 'stuff' that would be returned.
In the VERY RARE cases where it is what they wanted, I as the app developer/DBA manually ran it for them.
Remember: You have direct access to all the indexed tokens! You can pull ALL the tokens and end with a '2', spool out the output, then manually create a REALLY REALLY large contains query (probably a few contains statements due to max length allowed in contains) and run the select manually.
I let the users provide their own wildcards because they knew what they wanted and it wasn't my place to "guess" for them.
I trapped the DRG-51030 and provided a nice error message: Wildcard too vague.
When they would contact me and I asked what they searched and I saw things like '%2' I explained to them that likely wasn't what they really wanted given the amount of 'stuff' that would be returned.
In the VERY RARE cases where it is what they wanted, I as the app developer/DBA manually ran it for them.
Remember: You have direct access to all the indexed tokens! You can pull ALL the tokens and end with a '2', spool out the output, then manually create a REALLY REALLY large contains query (probably a few contains statements due to max length allowed in contains) and run the select manually.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
wow! I can't even begin to tell you how much I appreciate this effort you are making to help me with this. You are wonderful!
I also agree that the result set is providing more than the user may be wanting.
If they want to find a street called "George" and list the properties on it, they will also get all the owners with "George" in the name. They may at that point decide to go to the "advanced search" where they can specify whether it's an address or name. It is not a good approach to me, it's more work for the user, and I hate it. Not just as the DBA but as a user myself.
I will pass on your suggestions after I've tested them a bit. This may take a while.
I also agree that the result set is providing more than the user may be wanting.
If they want to find a street called "George" and list the properties on it, they will also get all the owners with "George" in the name. They may at that point decide to go to the "advanced search" where they can specify whether it's an address or name. It is not a good approach to me, it's more work for the user, and I hate it. Not just as the DBA but as a user myself.
I will pass on your suggestions after I've tested them a bit. This may take a while.
No problem.
I love Oracle Text and spent over 17 years learning it and still don't know it all.
I can feel your pain! There isn't much out on the web on how to do things. Had to learn a LOT of it myself and opening SR's with Oracle Support for the things I thought should be possible but couldn't figure out.
>>They may at that point decide to go to the "advanced search" where they can specify whether it's an address or name
Hint: Section groups and WITHIN... ;)
I love Oracle Text and spent over 17 years learning it and still don't know it all.
I can feel your pain! There isn't much out on the web on how to do things. Had to learn a LOT of it myself and opening SR's with Oracle Support for the things I thought should be possible but couldn't figure out.
>>They may at that point decide to go to the "advanced search" where they can specify whether it's an address or name
Hint: Section groups and WITHIN... ;)
ASKER
The developer is starting to adjust his code to reduce those wildcard searches. One way is if the first character is a number, it will no longer put the wildcard in front of the number. Instead of '%2 forrest%', it will search for '2 forrest%'.
We are seeing much better performance with the queries.
We are seeing much better performance with the queries.
I would still strongly encourage you to not add wildcards behind the scenes. Let the end user add them and train them on proper wildcard usage. It takes a LONG TIME and I had users that still never understood how to properly search the data.
If I'm looking for 2 forestlawn, I probably don't find 2 forestgreen very useful when it is WAY ACROSS town.
Let the end user control their own destiny!
'forest' is probably a bad example. How about 'peach%' for street names in Georgia!
I also still encourage you to NOT use phrase searches and have the developer 'tokenize' the search terms into NEAR syntax. So you will find '2 west forest' when looking for '2 forest'.
'2 west forest' is just as accurate as '2 forestlawn' and '2 forestgreen'.
If I'm looking for 2 forestlawn, I probably don't find 2 forestgreen very useful when it is WAY ACROSS town.
Let the end user control their own destiny!
'forest' is probably a bad example. How about 'peach%' for street names in Georgia!
I also still encourage you to NOT use phrase searches and have the developer 'tokenize' the search terms into NEAR syntax. So you will find '2 west forest' when looking for '2 forest'.
'2 west forest' is just as accurate as '2 forestlawn' and '2 forestgreen'.
ASKER
All great points slightwv. I completely agree with you.
This database contains only county property data (fortunately it isn't in Georgia). I can submit your suggestions to the developer but I have no control over what he chooses to do. He works for a different agency than me.
One thing that is good is that the developer (thanks to you) is aggressively pouring through the Oracle Text documentation and as he tests things out has incorporated some changes to his queries. I imagine he'll do more since we are still having issues with intermittent spikes in CPU on the database server and the logs write "no listener". Will probably open a case with Oracle to see if we can pinpoint the cause.
This database contains only county property data (fortunately it isn't in Georgia). I can submit your suggestions to the developer but I have no control over what he chooses to do. He works for a different agency than me.
One thing that is good is that the developer (thanks to you) is aggressively pouring through the Oracle Text documentation and as he tests things out has incorporated some changes to his queries. I imagine he'll do more since we are still having issues with intermittent spikes in CPU on the database server and the logs write "no listener". Will probably open a case with Oracle to see if we can pinpoint the cause.
I found the Text documentation to be quite lacking. MANY times I found the one piece I needed buried in a sentence or two in the middle of the docs. Had I not known EXACTLY what I was looking for, I would never have found it!!!
They are good for reference but NOT learning!
I'm here for future questions!!!!!!! Leverage my 17 years of Text and searching for things like addresses!
They are good for reference but NOT learning!
I'm here for future questions!!!!!!! Leverage my 17 years of Text and searching for things like addresses!
I believe the issue is still thinking of the indexed rows as 'complete strings' NOT as indexed Tokens. It can be a hard concept to embrace.
Yes, when searching the original stored varchar2, you may need double ended wildcards. In Text, you normally don't.
>>'%2 MYSTREET%'
You really cannot just add wildcards willy nilly well, unless it has the intended consequences.
Multiple words in a contains only separated by a space is considered a Phrase:
https://docs.oracle.com/en/database/oracle/oracle-database/18/ccapp/querying-with-oracle-text.html#GUID-F0106274-B719-4F65-BB62-65D52E6BDFA2
So what you asked Text to return is any phrase that has a token ending with a 2 and a token starting with mystreet.
So, using the quick example, are you sure that is what you are after:
Open in new window
Can you confirm that when searching the '2 mystreet, you actually want '1234567892 mystreetThatHasAReallyLong