Link to home
Start Free TrialLog in
Avatar of Julie Kurpa
Julie KurpaFlag for United States of America

asked on

Limit SQL result set based on SIZE

I'm an oracle dba and have been asked by developers to help them limit the result set size produced by a particular procedure within a Package.

The package was coded by a vendor who is no longer in the picture and the developers have limited understanding of how things all work together. It's all very complicated and nobody wants to be the one who "touched" it and broke it.

The problem is intermittent where the result set of the package is over the size limit that the 3rd party application can accept (4MB).  

I suppose the package query can be modified to limit the row count but before recommending that route, I'd like to know if there's a way to do it at a database level?

This is Oracle 11g Enterprise (yes we are working to migrate to Oracle 19c but can't cutover yet).  

I've been searching and cannot find any solutions other than modifying the package sql to limit the rows.  

Any suggestions?
Avatar of ste5an
ste5an
Flag of Germany image

Well, a single row can take more than 4 MB. It is actually ~4MB for VARCHAR(4000) * 1000 columns per row, but each LOB column can increase it.

I'm not aware of a database setting. The only option is imho to add ROWNUM predicate to the outer SELECT statements.

The problem is intermittent where the result set of the package is over the size limit that the 3rd party application can accept (4MB).  
Well, this is a pretty weird limitation of that application. I would ask for a fix for it instead of trying to fiddle in the actual packages.
Avatar of Julie Kurpa

ASKER

Well, this is a pretty weird limitation of that application. I would ask for a fix for it instead of trying to fiddle in the actual packages. 
 Yes I agree.  It's a situation where the output gets shared with multiple agencies.  Some agencies have limitations.  I honestly don't know much more beyond that.  

OK...so no database level throttle on the result set.  Bummer.



Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

What is the output of the package?  Is it a CLOB, SYS_REFCURSOR, ???
How is it sent to the outside agency?

If the data generated is 20MB, how will you notify the agency to expect 4 separate pieces of information per a single request?
From the brief explanation I got, it seems that when it's too big, it fails (this is intermittent which is fortunate).  The receiving agency does some sort of manual process to combine the pieces.

How it's sent?  ummm  does MSMQ mean anything to you?  That's what they use and that is where the limitation is.

In looking at the code, I see it's using SYS_REFCURSOR for the output.  Two cursors:

cur_item1_out OUT sys_refcursor,
cur_item2_out OUT sys_refcursor
>>does MSMQ mean anything to you?  

I know what it is.  Don't know anything else about it.


>>I see it's using SYS_REFCURSOR for the output.

Then the only thing I can think of is limiting the number of rows returned.  You should be able to guess how many rows can fit in 4MB.

You probably cannot use ROWNUM like suggested above without some pretty big changes to the procedure because of how Oracle assigns them.

What will probably be easier is to pass in two more parameters to the procedure: start_page and page_size.

Then use ROW_NUMBER() OVER() window function to limit the rows returned per call.
Thanks slightwv!
You are right, adding a ROWNUM looked very daunting.  

I'm waiting for them to give me some data that I can test with.

Do you happen to know how to call a package and pass variables to it?  I have no idea.

The package is named:  SEARCHPKG
The procedure inside SEARCHPKG is "SEARCHQUERY".
SEARCHQUERY has this specified for the IN and OUT:
--
PROCEDURE SEARCHQUERY (
idno IN VARCHAR2,
sendingAgency IN VARCHAR2,
cur_item1_out OUT sys_refcursor,
cur_item2_out OUT sys_refcursor
) AS...stuff.

Here's my attempt:

BEGIN searchpkg.searchquery('555555','xxxx'); END;

      *
ERROR at line 1:
ORA-06550: line 1, column 7:
PLS-00306: wrong number or types of arguments in call to 'SEARCHQUERY'
ORA-06550: line 1, column 7:
PL/SQL: Statement ignored
>>Do you happen to know how to call a package and pass variables to it?  I have no idea.

You'll need to change the procedure to allow those two parameters THEN change the code to process them correctly.

>>BEGIN searchpkg.searchquery('555555','xxxx'); END;

You'll need to create the two cursor variables and provide them to the call.


Something like this (I created a quick procedure that uses an OUT cursor):
create or replace procedure quick_test(cur_item1_out  out sys_refcursor)
is
begin
   open cur_item1_out for select 'Hello' bob from dual;
end;
/

variable mycur refcursor

exec quick_test(:mycur);
print mycur;

Open in new window

Ah OK...I've got it working now.  Much appreciation slightwv.  :)

I had a long conversation with one of the developers who was able to answer some of my questions.  The main developer is off today so we'll get with him after he's had his coffee to find out more.  

Still piecing things together.

Here's another question please....

When a variable is defined in a procedure as a certain size, does it stay that size regardless of the size of the value?

For example, the package builds a query statement and places it in this variable.

namesubqry      varchar2(32000) := '';

if the value of "namesubqry" only ends up being 1000 characters, would it's true size be 1,000 or will it be padded to equal 32,000?



>>does it stay that size regardless of the size of the value?

I suppose it depends on how you define "size".  I'm not going to dig through the docs but PL/SQL used to allocate the number of bytes of memory if the value for the varchar2 was 2000 or larger, PL/SQL would reserve enough memory to hold the max for the variable at runtime.  In your case, 32000 characters of memory when the procedure was executed.

As far as the "size" when accessed, I suggest you start with the docs:
https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/Data-Types.html#GUID-7B72E154-677A-4342-A1EA-C74C1EA928E6

Specifically:
VARCHAR2:  Variable-length character string
and
CHAR:  Fixed-length character data

or perform your own simple test:
declare
   var1 varchar2(32000) := 'A';
   var2 char(32000) := 'A';
begin
	dbms_output.put_line('var1 length: ' || length(var1));
	dbms_output.put_line('var2 length: ' || length(var2));
end;
/

Open in new window

If you have a table tableName

Create view limited result set with 1000 rows for example and returning the same set of fields:

     CREATE VIEW viewName AS
     SELECT field1, field2, ... FROM (
        SELECT field1, field2, ...,
        ROW_NUMBER() OVER(
           PARTITION BY pk_id ORDER BY pk_id DESC
        ) rn 
        FROM tableName
      ) 
      WHERE rn =< 1000;

Open in new window

"ORDER BY pk_id DESC" - defines which first 1000 rows will be included.

Another case:

If you have "SELECT value FROM table" and you want to limit the size of value with 4000 bytes you can:
SELECT SUBSTR(value,1,4000) AS fieldName FROM table

>>Create view limited result set with 1000 rows

That will only return the 'top' 1000 rows.  Using that method will make it IMPOSSIBLE to ever return the rest of the results.  So, it won't work.

>>and you want to limit the size of value with 4000 bytes you can:

That isn't even part of the question.  It is about limiting the size of the entire result set.

Please read the question in more detail before posting stuff that won't work.
How it's sent?  ummm  does MSMQ mean anything to you?  That's what they use and that is where the limitation is.
Where is MSMQ used? Who is responsible for it? These people need to implement a store and forward like system before MSMQ.
Slightwv,  ah that was a good test.  I see that a varchar2 adjusts it's size based on the contents.  Thank you!

declare    
var1 varchar2(32000) := 'A';    
var2 char(32000) := 'A';
begin
   dbms_output.put_line('var1 length: ' || length(var1));    
   dbms_output.put_line('var2 length: ' || length(var2));
end;
/

output:
var1 length: 1
var2 length: 32000




The developer has had his coffee this morning and we've opened dialogue.  

The package actually contains 2 procedures with the same name.  It seems that the package knows which one of those procedures to execute based on the parameters passed.

What I've learned so far is that the package builds a SQL statement based on the parameters passed.  The built SQL is not a cut and dry query but one that does a soundex search and other generic stuff.  ugh

It then sends that built SQL statement through MSMQ (as XML) to another environment (Biztalk) which executes the SQL in another database.  The result set is then sent back to the user who reviews the data to select the correct item from the results.

Only the developer has access to the XML and error message  and this morning has been able to give me more information to narrow down which of the 2 procedures is getting executed when the error occurs.  
He's working on finding out what data is actually being passed to the package when the error occurs so I can try and recreate the query.  

Then I'll probably reach out to you on how to incorporate the row_number() stuff into the building of the SQL in that package.

I'm wondering if I should open another question and close this one?
 
Probably doesn't need to be a new question since it is still related to how to reduce/limit the results.

I'll provide some proof of concept code for you to incorporate into your procedures.

BUT, if you are looking to post the actual code and have us to make the code changes for you, I'm afraid you might be disappointed.
Thanks.  
I wasn't going to try to get you to do that.  
Don't do anything right now until I can gather more info.  

I say this because they've just provided me the data that was searched for the last time it failed and it looks to me like it's not a result set size problem like they think.  

Please hold on using your time for this until I can get more info.  

I GREATLY appreciate your help.   
I discovered that the original developer of this Package (from 2008!) has a little statement in the package to write the built SQL statements to a table as a CLOB!   So I was able to grab the SQL that is returning the large result set.

To try and figure out the size of the resultset, I turned on tracing and ran the query.  Do you think the "4,189" is the value showing an accurate result set size returned?

Statistics
----------------------------------------------------------
     277502  recursive calls
          1  db block gets
         61  consistent gets
          0  physical reads
        212  redo size
     692649  bytes sent via SQL*Net to client
       4189  bytes received via SQL*Net from client
        340  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
       5076  rows processed



>>is the value showing an accurate result set size returned?


Nope.  That is "bytes" sent from the client connection to the server.

Now the command sent 692649  bytes from the server back to the client.  I cannot say if that was the entire resultset or not.  That is just "bytes" back and forth from the client connection to the db server.
If I spooled the output to a text file, would the size of the text file be a close representation of the result set size?
Should be close.

Without knowing exactly what MSMQ is doing with it, there may be some space padding and possible extra lines for page breaks depending on how you set up things prior to spooling.
OK.  the text file ended up being 4,481 KB.  

For the query, I've chosen a place to put the ROWNUM <= XX statement.  

Here is a gist of the query with where I want to put the ROWNUM statement.   It runs and gives me the expected rows.  

select * from (WITH xx AS
(
--subquery here with soundex
)
, yy as
(
-- subquery here
)
--subquery
union all
--subquery
)
where rownum <= xxxx  <<<this where I think to put rownum limit.
order by column asc, (UTL_MATCH.JARO_WINKLER_SIMILARITY(LASTNAME, upper('DOE')) + UTL_MATCH.JARO_WINKLER_SIMILARITY(FIRSTNAME, upper('JOHN'))) DESC;






>>OK.  the text file ended up being 4,481 KB.  

Well that technically is over the string 4MB limit.  BUT it's by just a little bit.  Off the top of my head, I'm not sure how to reduce the results from a ref_cursor.  The extra data might be spacing applied when being 'displayed'.  I'm not sure how to set up a test case to see if that is what is happening.

>>where rownum <= xxxx  <<<this where I think to put rownum limit.

As I mentioned above, you won't be able to use ROWNUM like that to limit data.

There's a LOT of information on the web about this pseudo column and how it applied AFTER the order by.  It also cannot be used in a BETWEEN type query so how do you plan on getting the next "batch" of rows?

Yes, you "can" go through the tricks to get it to work but as I posted using ROW_NUMBER is cleaner.
I am reading about ROW_NUMBER () OVER () and am having trouble incorporating it into the SQL.

The SQL is long so I'm trying to leave out stuff you don't need to see.  Hopefully I haven't made it more confusing.

THIS IS THE ORIGINAL:

 select * from (WITH nq AS
(select mytable.idno as idno
from  owner.view_person@sid mytable
WHERE  mytable.idno is not null
AND    mytable.FIRSTNAME LIKE 'J%'
AND    (mytable.C_NAME LIKE 'DOE%' or mytable.C_SOUNDEX LIKE 'D000%')
)
)
, dq as
(
-- subquery here getting person data based on "nq.idno" match
)
--subquery getting additional person data based on "nq.idno" match
union all
--subquery getting yet more person data based on "nq.idno" match
)
order by SEARCH_ORDER asc, (UTL_MATCH.JARO_WINKLER_SIMILARITY(LASTNAME, upper('DOE')) + UTL_MATCH.JARO_WINKLER_SIMILARITY(FIRSTNAME, upper('JOHN'))) DESC;

Open in new window


THIS IS MY ATTEMPT WITH ROW_NUMBER():

 select *, row_number() over (order by SEARCH_ORDER asc) as ROWNUM  from (WITH nq AS
(select mytable.idno as idno
from  owner.view_person@sid mytable
WHERE  mytable.idno is not null
AND    mytable.FIRSTNAME LIKE 'J%'
AND    (mytable.C_NAME LIKE 'DOE%' or mytable.C_SOUNDEX LIKE 'D000%')
)
)
, dq as
(
-- subquery here getting person data based on "nq.idno" match
)
--subquery getting additional person data based on "nq.idno" match
union all
--subquery getting yet more person data based on "nq.idno" match
)
order by SEARCH_ORDER asc, (UTL_MATCH.JARO_WINKLER_SIMILARITY(LASTNAME, upper('DOE')) + UTL_MATCH.JARO_WINKLER_SIMILARITY(FIRSTNAME, upper('JOHN'))) DESC;

Open in new window


Here's the error I keep getting:

select *, row_number() over (order by SEARCH_ORDER asc) as ROWNUM  from (WITH
        nq as (select mytable.ido.idno as mtc
        *
ERROR at line 1:
ORA-00923: FROM keyword not found where expected



Many things wrong and not advised with all that.

First, although 'legal' SQL, it is bad form:  Wrapping a 'select *' around the CTEs (with).

What you have is:
select * from (
with cte as (
	select 'Hello' bob from dual
)
select * from cte
);

Open in new window


It is typically just written as:
with cte as (
	select 'Hello' bob from dual
)
select * from cte

Open in new window


Second:  Get in the habit of NOT using '*' in a select.  It too is bad form and can break more code than it ever solves.

Get in the habit of specifying the individual columns.

Next:  ROWNUM is a reserved word so using that as a column alias is also bad.

So, here is what I suggest, you remove the 'select * from (' from around the entire CTE.
Make a new CTE using ALL the specific column names and add row_number to that.

Then limit the rows in the final select.  No need for the last 'order by' because the row_number will do that for you.


Here's the quick test I suggest earlier.
The setup:
--set up some dummy data
create table tab1(col1 char(1), col2 char(1));
insert into tab1 select dbms_random.string('a',1), dbms_random.string('a',1) from dual connect by level <= 1000;
commit;

Open in new window


Here is more or less what you have based on the test case:
select * from (
with cte as (
	select col1, col2 from tab1
)
select col1, col2 from cte
)
order by col1;

Open in new window



Here is what I'm suggesting pulling on rows 10 through 20:
with cte as (
	select col1, col2 from tab1
),
cte_final as (
	select col1, col2, row_number() over(order by col1) rn
	from cte
)
select col1, col2
from cte_final
where rn between 10 and 20
;

Open in new window



For your final code, add start_page and page_size as input parameters and that should be all you need to compute the values for:  where rn between 10 and 20
ah I see it better now.  Thank you!  
I've tweaked it and it seems to be producing the correct data in the needed order.  
I submitted it to the developers to look at and test since they know the data better than I and can compare it with the previous result sets.  

Not fully understanding what you mean by start_page and page_size as input parameters.  You're not talking about something sent from the application are you?   Because nobody here wants to touch that part. 
>>Not fully understanding what you mean by start_page and page_size as input parameters.  You're not talking about something sent from the application are you?

Yes.

Using row_number you now have a way to limit the number of rows coming back.  So, whatever is receiving the results needs a way to know they don't have 'everything'.  Right?

How will they know that?

So, you magically decide that 1,000 rows is a "batch" and the query as it is now returns 3,000 rows.  The app needs to make 3 calls to the procedure to get ALL the rows.

To do that, you need to tell the procedure what 'batch' is being asked for.  To do that, I suggest two parameters:  start_page and page_size.
If MSMQ is the problem, then you need a layer around it to do the paging, when you don't want to touch the application. But this requires a clean application and qualified people to write/add such a layer or even middleware.

This is not trivial at all.
So, you magically decide that 1,000 rows is a "batch" and the query as it is now returns 3,000 rows.  The app needs to make 3 calls to the procedure to get ALL the rows.

I hadn't planned to send multiple batches.  Just limit it to 1 smaller one.  It's not required that the entire result set get sent.

To explain why the user doesn't need to see the entire result set:  

This application is related to the justice system.  When a person is entered into the system, this query does a search to see if they are already in the database.   The query will return results that most closely matches the data entered by the user.   This is using some sort of soundex column and  the UTL_MATCH.JARO_WINKLER_SIMILARITY that you saw in my sample code.  I don't know anything about how that works.

The user will review the first several rows returned. If they see a match at the top, good...they will update that record with the new information.  Otherwise, they might review the first several lines of the result set looking for another possible match.   They can't possibly go through all 5K names so limiting them to a smaller set is fine.

For example:  searching for SMITH, JOHN  with DOB of 01/01/1901  might return (in this order):

SMITH, JOHN 01/01/1901  <--- if he's in the database already
SMITH, JO   XX/XX/XXXX   <-- these others because they are similar
SMITH, JOE XX/XX/XXXX
SMITHY, JOHN XX/XX/XXXX
SMYTHE, JACK XX/XX/XXXX

Using the original search criteria (where the subject was not already in our database) with the new code (using ROW_NUMBER() and WHERE RN BETWEEN 1 AND XX),  the result set was sorted in the correct order and I saw similar names listed first.  Like below:

SMITH, JO   XX/XX/XXXX
SMITH, JOE XX/XX/XXXX
SMITHY, JOHN XX/XX/XXXX
SMYTHE, JACK XX/XX/XXXX

This was very encouraging.
 
So I changed the search name to one that I knew existed in the database expecting the result to show the matching name first followed by the other similar names.  But it did not and the result set started with names beginning with the letter "A" (actually this varied). 

Removing the   "WHERE RN BETWEEN 1 AND XX" showed me what I expected like below:

SMITH, JOHN 01/01/1901  <--- Because he's in the system already
SMITH, JO   XX/XX/XXXX   <-- these others because they are similar
SMITH, JOE XX/XX/XXXX
SMITHY, JOHN XX/XX/XXXX
SMYTHE, JACK XX/XX/XXXX

I'd want to show you my code (with extraneous columns removed) but if you feel this is doing my work for me, I understand.

Obviously I'm completely missing something about this process. 
>>So I changed the search name to one that I knew existed in the database expecting the result to show the matching name first followed by the other similar names.

That's one of the issues using soundex.  It's pretty dumb when looking for similar things.  The actual logic is posted in the docs somewhere.

I played with JARO_WINKLER_SIMILARITY and even the EDIT_DISTANCE_SIMILARITY function in the past and found them to be poor as well.

You can read about them in the docs as well:
https://docs.oracle.com/en/database/oracle/oracle-database/21/arpls/UTL_MATCH.html#GUID-D2CEE27F-3558-4607-A8FB-D5BB5C6EBFE0

If you can set up a very simplified test case with some names and your target name, then a soundex and JARO_WINKLER_SIMILARITY query showing it isn't working, I can take a look at it to see if I can offer an opinion.

Should be a simple table with a name column and a pretty simple SQL statement.  I'm not looking for a few hundred line SQL statement.
Thanks.  Here's the results from the new SQL using the same search criteria.  

Here's the query attached.
Test_Query_for_EE.sql

Output results for new query(with RN BETWEEN 1 AND 5000):  
       - if name & dob match someone:  the match is at the top of the result set. The data following is random.
       - if only name matches someone: random result set...no similar names at all.

Output results for new query (removed "where RN BETWEEN 1 AND 5000"):
       - if name & dob match someone:  the match is at the top of the result set followed by other similar names.
       - if only name matches someone: similar names listed first.


That value I'm using for C_SOUNDEX in the SQL might not be correct.  I need to figure out how the package decides what to put there. 
You didn't provide the simple test case that I asked for.  I cannot walk your code and try to figure out what you need to do and how it all is supposed to work and what decisions were made and why.

I was looking for a simple table like:
create table names(first_name varchar2(50), last_name varchar2(50));
insert into names values...<whatever sample data you want to provide>

Open in new window


Then a simple select against that using your soundex and jaro_winkler calls to show what isn't working for you and what you think should be working.

If you cannot come up with that, then I don't know how you are expected to take the procedure you have and get it working.

All that said:
I have no idea why JARO_WINKLER_SIMILARITY is only in the ORDER BY.  Seems like a waste.  It should probably be used in one of the CTEs to help weight the match.


I don't think you put the ROW_NUMBER function in the proper place.  You didn't follow my suggestion.  Also, row_number cannot be 0 so no need to include it in the where clause.

Here is what my example above was trying to show you:
Keep the 'jk' CTE as it was without the row_number.

Take your final select without the WHERE and make that a NEW cte and add row_number there

You don't need a final ORDER BY because the ROW_NUMBER function already took care of that.

search_order seems to be somewhat important so it might need to be part of the row_number.

...
, newCTE as (
select CNTQRYTYPE,
SEARCH_ORDER,
FIRSTNAME,
LASTNAME,
PIN,
DOB,
SEX_CODE,
RACE_CODE,
		row_number() over(order by LASTNAME,FIRSTNAME,D_DATE_OF_BIRTH) as RN
FROM JK
)
select distinct CNTQRYTYPE,
SEARCH_ORDER,
FIRSTNAME,
LASTNAME,
PIN,
DOB,
SEX_CODE,
RACE_CODE
from newCTE  where rn between 1 and 500

Open in new window


Did you read toe docs on soundex?  It tells you how it works.
https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/SOUNDEX.html#GUID-9C43625B-70CA-4B43-AE22-5EC2A02192F8

The reason I ask is because of this:
(em.C_NAME LIKE 'SMITH%' or em.C_SOUNDEX LIKE 'D000%')

In the docs link, "rule 1" states:
Retain the first letter of the string

So, 'John' or 'Smith' CANNOT create a soundex results that starts with a 'D'  So I have no idea what that LIKE is supposed to be doing.  I also have no idea why you stored the values from previous soundex calls.
Thanks Slightwv.   I'm doing my best.  
I've got some test data set up for you (only 125 rows).  I have to "tweak" the names to protect the innocent.  I made sure the soundex stuff was correct according to the names.  

Running the query using your suggested SQL above seems to work fine for the 125 rows and shows me data that is at least in the ballpark at the top.

But when I change it to run against the REAL table (with the 1,000's of rows) using same SQL code, the result set begins with names starting with the letter "A".  

Attaching both test data and query. Create_Table_Insert_Data.txtTest_Query_for_EE.sql
Thanks for the files.  I've loaded everything up and ran the query.

>>seems to work fine for the 125 rows and shows me data that is at least in the ballpark at the top.
I see things that I wouldn't expect.

You are searching for a 'J DEW' and I see this in the results:
NAME_RACE_SEX_DOB 3 THYME                                    JAMES                TIBERIUS   07-AUG-44

I understand why it's in there.  James Thyme has the same i_spn_pin  as the James Dew.

The one you are after is actually a DOB of 07-AUG-1933.  If you don't have a way to properly join nq with em in the final query, you'll get bad stuff as long as i_spn_pin  is repeated.


That said:
I think it returns extra data.
If a name and DOB matches in both NQ and JK, why return them both?

I would expect the name with DOB to be pulled and the same row with just name to be ignored.


I also see NO reason to store the c_soundex column.  I don;'t see where it serves any real purpose.

I realize that the SQL you provided is a small subset of much larger code but you are also only doing right-hand wildcards. So the rest of the the soundex values don't do anything.  For example the first row has 'S530 B430 O416'.  You never look for the O416?


I'm going to take a leap of faith and assume that if a row gets BOTH a search_order of '3' and '4', you want '3' instead of '4'.

The params CTE is just a simple way to not have to keep hard-coding values everywhere if I need them in more than one place.


Your query against that data returns 16 rows and has what I believe to be 'bad' results.

Here is how I would rewrite that SQL using the logic you have.  It returns 9 rows and doesn't use the soundex column.

with params as (
	select
		'J%' p_first_name,
		'DEW%' p_last_name,
		'08071933' p_dob
	from dual
)
select
	case when em.d_date_of_birth = to_date(p_dob,'MMDDYYYY') then 'NAME_RACE_SEX_DOB' else 'NAME_RACE_SEX' end AS CNTQRYTYPE,
	case when em.d_date_of_birth = to_date(p_dob,'MMDDYYYY') then '3' else '4' end AS SEARCH_ORDER,
	replace (em.FIRSTNAME, chr(39), '') AS FIRSTNAME,
	replace (em.MIDDLENAME, chr(39), '') AS MIDDLENAME,
	replace (em.LASTNAME, chr(39), '')    AS LASTNAME,
	em.i_spn_pin                          AS PIN,
	em.D_DATE_OF_BIRTH                    AS DOB,
	em.C_GENDER_CD                        AS SEX_CODE,
	em.C_RACE_CD                          AS RACE_CODE
from  test_table em
	cross join params
WHERE  em.i_spn_pin is not null
AND    em.FIRSTNAME LIKE p_first_name
AND    (em.C_NAME LIKE p_last_name or em.C_SOUNDEX LIKE soundex(p_last_name) || '%')
;

Open in new window

Thanks Slightwv.

I understand why it's in there.  James Thyme has the same i_spn_pin  as the James Dew. 
I didn't notice that.  But it makes sense.  The output is correct then.

If a name and DOB matches in both NQ and JK, why return them both?

I know the data is confusing. It's related to a criminal justice database.   The developers who work with this data could explain it better but here's my attempt:

Due to the nature of criminal justice, not all data is available when someone is entered into the system.  

The i_spn_pin is a unique identifier that gets assigned to an individual regardless of whatever name or DOB they used when they got entered into the system.  But the i_spn_pin helps to track an individual regardless of what name (or the lack of important data)  is provided.

For example the first row has 'S530 B430 O416'.  You never look for the O416?

When a person is entered into the database,  an Oracle package uses the criteria entered by the user and builds the query we've been testing with.   The dynamic query uses the same structure but the search criteria may vary.  
The result set helps them determine if this person should be assigned a new i_spn_pin or can be linked to an existing i_spn_pin.    Middle names may not be available and may not exist in the other entries (if indeed there are any) for this person.
I'm going to take a leap of faith and assume that if a row gets BOTH a search_order of '3' and '4', you want '3' instead of '4'.

If there's a match for the '3' group, it's preferred that it be listed first.   But just in case the wrong DOB is entered (or provided), they'd like to see other potential matches without DOB.

It's so complicated but this is what they came up with years ago that seems to work decently.  

It's only that intermittently the result set of possible matching names is too large.
The developers gave me the date & time of the query, I found it, and am using it for a base to try and get it to limit the result set.
I suspect that the issue occurs when the exact name is not found at all and the soundex causes the result set to include too many possible combinations.

I hope that help.  

ASKER CERTIFIED SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks.   There is NO way they will make changes to the application side.   :)

I think I've made a small breakthrough and the query is giving me what I think is correct data in the correct order. I'm not ready to post what I ended up doing just yet.

The developers are now testing with it through the application.  
I know this is not what you suggested (I'm leaving the "select *" in the code) but it's what I ended up doing mostly because I didn't want to do any more SQL changes than was absolutely necessary.  The developers have been using this query for a few years already just like it is and are happy with it.  

The changes I put in have been working well and they and the users were pleased.  I GREATLY appreciate the help you gave with this process as I would never have been able to figure it out.

I added a new CTE at the beginning that included the UTL_MATCH.JARO_WINKLER stuff which gave me the result set in the proper order.  Then the 2nd CTE selected the first 2000 rows from the 1st CTE.    The rest of the code pretty much stayed the same.

select * from (WITH   nql as (select em.i_spn_pin as spn
from  myschema.view_persons@DBLINK em
WHERE  em.i_spn_pin is not null
AND    em.FIRSTNAME LIKE 'P%'
AND    (em.C_NAME LIKE 'SMITH%' or em.C_SOUNDEX LIKE 'S530%')
ORDER BY (UTL_MATCH.JARO_WINKLER_SIMILARITY(LASTNAME, upper('SMITH')) + UTL_MATCH.JARO_WINKLER_SIMILARITY(FIRSTNAME, upper('BLUTO'))) DESC
)
, nq as (
select spn from nql where rownum <= 2000
)
      ,dq as (select em.i_spn_pin as spn
from  myschema.view_persons@DBLINK em
WHERE  em.i_spn_pin is not null
AND    em.FIRSTNAME LIKE 'P%'
AND    (em.C_NAME LIKE 'SMITH%' or em.C_SOUNDEX LIKE 'S530%')
AND    em.d_date_of_birth = to_date('01201965','MMDDYYYY')
)
select distinct 'NAME_RACE_SEX'               AS CNTQRYTYPE, 
'4'                 AS  SEARCH_ORDER, 
        replace (em.FIRSTNAME, chr(39), '') AS FIRSTNAME,
        replace (em.MIDDLENAME, chr(39), '') AS MIDDLENAME,
        replace (em.LASTNAME, chr(39), '')    AS LASTNAME,
        em.i_spn_pin                          AS PIN,
        em.D_DATE_OF_BIRTH                    AS DOB,
        em.C_GENDER_CD                        AS SEX_CODE,
        em.C_RACE_CD                          AS RACE_CODE
FROM  myschema.view_persons@DBLINK em 
     ,nq
WHERE   em.c_spn              = 'Y'
AND   em.c_allow_search = 'Y'
AND   em.I_SPN_PIN          =  nq.spn
UNION ALL
select distinct 'NAME_RACE_SEX_DOB'           AS CNTQRYTYPE, 
'3'                 AS  SEARCH_ORDER, 
        replace (em.FIRSTNAME, chr(39), '') AS FIRSTNAME,
        replace (em.MIDDLENAME, chr(39), '') AS MIDDLENAME,
        replace (em.LASTNAME, chr(39), '')    AS LASTNAME,
        em.i_spn_pin                          AS PIN,
        em.D_DATE_OF_BIRTH                    AS DOB,
        em.C_GENDER_CD                        AS SEX_CODE,
        em.C_RACE_CD                          AS RACE_CODE
FROM  myschema.view_persons@DBLINK em 
     ,dq
WHERE   em.c_spn              = 'Y'
AND   em.c_allow_search = 'Y'
AND   em.I_SPN_PIN          =  dq.spn
) 
ORDER BY SEARCH_ORDER ASC, (UTL_MATCH.JARO_WINKLER_SIMILARITY(LASTNAME, upper('SMITH')) + UTL_MATCH.JARO_WINKLER_SIMILARITY(FIRSTNAME, upper('BLUTO'))) DESC

Open in new window


I see you continue to heed my warnings about using ROWNUM.

The optimizer can and probably will rewrite your SQL so, you may not be getting the 2000 rows you think you are getting.

You also normally don't wrap your CTEs in an outer select like that but, OK, to each their own.

I'm still convinced you don't need to hit myschema.view_persons@DBLINK  4 times.

Then you have things that make no sense at all:
...
em.FIRSTNAME LIKE 'P%'
...
UTL_MATCH.JARO_WINKLER_SIMILARITY(FIRSTNAME, upper('BLUTO')
...
Thanks Slightwv.  

Yes I wondered about some of that code also.

BUT it's not my code and was written years (like 10?) ago by a developer who has retired.   The current developers (who are not really Oracle developers) took one look at this code and tossed it to me because I'm the Oracle DBA.   I like a challenge but I can't (won't) rewrite it because it's not my job to do so and I have to draw a line somewhere.   Plus I'd like to go back to doing my real job.  :)
 
The code is generated by a PL/SQL Package that creates a different set of SQL based on the criteria passed from the application.  I had to jump through hoops just to add that extra CTE without messing up the rest of the package code. 

You are wonderful and so smart.  Always helping me so patiently.  I've got to figure out how to close this question.  
If the 'BLUTO' is in there, it's wrong.

>>because it's not my job to do so and I have to draw a line somewhere.

I'm sorry you feel that way.  You claim this is related to criminal justice.  What if the bad guy gets away because the system missed something because of a simple SQL issue that returns wrong results?

If correct results really don't matter, then the application really doesn't matter so just do away with it.  If the system is important, it should function properly.

>>because I'm the Oracle DBA.

Oracle DBA's write SQL too.  They should also be familiar with the databases they support and how they operate.

>>I've got to figure out how to close this question.  

Accept the post or posts that helped and aided in providing the solution by selecting "This is helpful" and "This is the solution" to as many comments that performed those tasks.



All that said, I ran the SQL you posted with the same data you provided back in #a43285239

I believe it has a bug.
Using the parameters from way back then:
first name:  'J%'
last_name: 'DEW%'
and DOB: '08071933'
Your SQL returned:
CNTQRYTYPE        S FIRSTNAME  MIDDLENAME LASTNAME          PIN DOB       S R
----------------- - ---------- ---------- ---------- ---------- --------- - -
NAME_RACE_SEX_DOB 3 JAMES      TIBERIUS   THYME           57713 07-AUG-44 M W

Open in new window


because it matches the pin of 57713 and doesn't match the DOB.  Based on what I understand of your requirements, this is wrong.  If should probably only hit on the NAME_RACE_SEX piece not the NAME_RACE_SEX_DOB.

Anyway, here's my latest attempt at a much more efficient rewrite.  It only accesses the table once where yours queries it 4 times.

I cannot compare the results with your query because of the bug I believe yours has.
with params as (
	select
		'J' p_first_name,
		'DEW' p_last_name,
		'08071933' p_dob
	from dual
),
all_names as (
	select distinct
		max(case when
			em.FIRSTNAME LIKE p_first_name || '%' and (em.C_NAME LIKE p_last_name || '%' or em.C_SOUNDEX LIKE soundex(p_last_name) || '%')
		then 'Y'
		end) over(partition by i_spn_pin) name_match,
		case when em.d_date_of_birth = to_date(p_dob,'MMDDYYYY') then 'Y' else 'N' end AS dob_match,
		utl_match.jaro_winkler_similarity(upper(lastname), upper(p_last_name)) last_jaro_sim,
		utl_match.jaro_winkler_similarity(upper(firstname), upper(p_first_name)) first_jaro_sim,
		replace(em.FIRSTNAME, chr(39), '') AS FIRSTNAME,
		replace(em.MIDDLENAME, chr(39), '') AS MIDDLENAME,
		replace(em.LASTNAME, chr(39), '')    AS LASTNAME,
		em.i_spn_pin                          AS PIN,
		em.D_DATE_OF_BIRTH                    AS DOB,
		em.C_GENDER_CD                        AS SEX_CODE,
		em.C_RACE_CD                          AS RACE_CODE
	from  test_table em
		cross join params
	WHERE  em.i_spn_pin is not null
),
score_names as (
	select
		case when dob_match='Y' then 'NAME_RACE_SEX_DOB' else 'NAME_RACE_SEX' end AS CNTQRYTYPE,
		name_match,
		dob_match,
		firstname,
		middlename,
		lastname,
		pin,
		dob,
		sex_code,
		race_code,
		last_jaro_sim + first_jaro_sim name_score,
		row_number() over(partition by name_match order by last_jaro_sim + first_jaro_sim desc) name_rn
	from all_names
)
select *
from score_names
where
name_match='Y' and (name_rn<2000 or dob_match='Y')
order by name_score desc
;

Open in new window

The way the developer explained this to me:
this particular SQL you helped me with assists with determining if an inmate has ever been arrested in our county before.  If they've been arrested before, they have a unique number assigned to them (SPN). That SPN is reused each time the individual is arrested. If no match is found, the person gets a new SPN. Not a big deal.   No one is going to escape justice if a person gets a new SPN assigned.  

Periodically, SPN consolidations are done to match up persons with multiple SPNs into one SPN which is a separate program.  This is a common practice and does not aid in location and conviction of a criminal either.

For the SQL you helped me with, I did a side by side comparison between the original code result set and the new result set. Each time, it was comparable.  

The developers also tested my changes to the SQL, reviewed the result set, and blessed the output.  I actually refused to put it into production when they approved it because I suspected they hadn't done enough to test it.  

But at some point when 5 developers and their manager say "We don't need to do anymore testing. Please just put it into production now",  unless there's a glaring mistake, I have to trust their judgment.  





May not be a "glaring" mistake but I feel there are still some decent issues with it.

The fact you have another program/team that needs to perform "SPN consolidations" shows it has issues.

Sure, there will always be a few that will slip through but fixing known issues with the code while you are updating the code would save someone a lot of time later.

I have a character flaw where I must make things as good as I can and will beat the proverbial dead horse until it is.

If you are happy with what I feel is flawed and bad SQL, it's your system.  If you want to continue to make it better, I'm here.
I value your opinion GREATLY, I will look closer at this code and see if I can prove this bug.  

I also admire your "character flaw" and wish I worked with more people like you.

It is complicated (well...for me anyway) because this SQL is only an example of what sort of code can get generated by the package that builds it.  It's based on the parameters passed to it by users and can result in many variations of the code you've helped me with.

The package has several sections where it reuses SQL code to generate the various SQL scripts.  Changing one part of it will break other parts.  

With no access to the application, I will need to try to run the package manually, pass various parameters to it and see what SQL gets generated.  After figuring out if the SQL is built efficiently (probably not), change the structure, and test it.  Then try and incorporate the changes into the package that builds it without breaking anything else. It was AWFUL trying add that one CTE.  

To do this would mean weeks and weeks of effort on something beyond what I was asked to do, is in someone else's backyard, and is not considered broken.   They asked me to limit the result set, which I did.  

 



>>They asked me to limit the result set, which I did.  

Did you?  Yes it limits the rows returned but I don't think it limits them the way you think it does.  Read up on using ROWNUM to limit rows with ordered data.

And see my comment about "NAME_RACE_SEX_DOB 3 JAMES      TIBERIUS   THYME" being returned.  I believe that shouldn't be there.

If it returns incorrect data, did you really meet the requirement?

>>The package has several sections where it reuses SQL code to generate the various SQL scripts.  Changing one part of it will break other parts.  

It sounds way overly complex.  I'm betting it can be greatly simplified.

As soon as I hear "SQL generating SQL", I cringe.  That is almost never necessary.  I say "almost" because I've done it.  It was painful to make myself do it but it performed better than lumping everything into a single SELECT and I was dealing with millions of rows.