Performance, subselect vs. Join

bahlsten
bahlsten used Ask the Experts™
on
Hi!

I want to know if there is a difference between the two following queries.
If it is, is there a performance penalty?
Does this depend on the optimizer mode (cost, rule) ?
Does Oracle cashe the result from the subselect?
In my example there are 5,6 records on average type/groupid with about 10 groups.
The product table however, is large.

-- First Query
SELECT 'x'
   FROM product_type t, product p
WHERE t.groupid = 50
AND p.typeid = t.id;


-- Second query
SELECT 'x'
FROM product p
WHERE typeid IN ( SELECT t.id FROM product_type t WHERE t.groupid = 50);


/ Ronnie
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
There could be depends on the size of the tables.  Guideline to using IN (vs Exists)

If the outer query is small and the inner query is big then use EXISTS,  if the outer query is big and the inner query is small user IN, if both are small or both are big, it will depend on the indexes and should be tested (set autotrace on)

Which Optimizer you use can make a difference on performance

Commented:
The join query leaves it to the optimizer to determine the access path: it could drive from either table, or perform a hash join or whatever.  

The IN query forces the optimizer to drive from the product_type table.  If the number of product rows matching the subquery is small compared to the total number of product rows, this will perform well.  Otherwise, EXISTS would be better.

If you cannot reliably predict whether IN or EXISTS will be better, then the join query is preferable, as the optimizer can use statistics to generate the best plan.
Guy Hengel [angelIII / a3]Billing Engineer
Most Valuable Expert 2014
Top Expert 2009

Commented:
The main question is what indexes exists (on both tables), and how the values of groupid and typeid are distributed. using IN or JOIN should NOT change the (performance) results, as oracle engine first checks the costs for the sql and should find the same execution plan!

The second query is better readable for human...

CHeers
Should you be charging more for IT Services?

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
Thank you for all your answers so far!
However, it is now clear to me that I didn't write it all down :)

This is the case:
The query in question must execute as fast as possible.
It's on a performance critical path, so to say.
It would obviously be faster to enumerate the ids from product_type like this:

-- Hardcoded example
select ...
from product where typeid IN(1,2,3,4,5);

The ids are not static, so I cannot hardcode the values. The only static value I
can rely on is [groupid = 50], which will never change (yeah right...).
Therefore I need the records from [product_type] whose [groupid = 50].

My main question is: If I write the query using the IN( select...), would Oracle
recognize it as a possibility to lookup the list of ids from product_type and then "rewrite"
my query as my hardcoded example above and thus querying the product_type only once.

The following index exist:
product id PK
product typeid FK[product_type]

product_type id PK
product_type groupid FK[group_type.id]

As for the data, the nr of records in each table is something like this:
product: 2,000,000
product_type: 100
group_type: 15

The names of the tables are alias to protect their anonymity ;)
Also, the example is simplified bogus :)

Another possibility that comes to mind is:

-- This is pseudocode for what the Java program might do...
sql = "select id from product_type where groupid = 50";
fetch all id into an array;
var csv = constructCSVFromArray( array );

sqlProduct = "select 'x' from product where typeid IN(" + csvId + ")";

Would this be an even better approach?
The product query might consume so much time it doesn't matter if I execute two queries instead?

Again, thanks for taking your time!

/ Ronnie

Commented:
if it is an equijoin, it's the fastest way to do the job, but the order of columns in the select clause and the order of where
clause conditions may affect performance dramatically. Make sure to have proper indices on both tables. (typeid for product
and (typeid, groupid) on product_type)

The fastest solution to such problem is to deflat the data model and store the necessary product type data
directly with product to query one table instead of two.

Commented:
> Would this be an even better approach?

No, a simple SQL statement will be faster than this procedural approach.

Make sure you have appropriate indexes, analyze the tables, then benchmark to see whether IN or join version is faster (or maybe they are the same).

Do you actually have a performance problem?

Author

Commented:
andrewst:
As I said, the example is a bit simplified...
The real data represents other things than products and the actual query joins several tables.
I thought that the number of tables in the from clause would somehow affect the performance and
therefore desided to investigate if removing a couple of "type tables" would increase the speed.

The actual query uses about 8 or 9 tables in the from clause and 3 of them would be "type tables",
that is a table that only contains an id and a name. The different types within a group all have the same meaning
whereas the different groups have different meanings.
I have allready applied all the other standard optimizations such as reordering the tables in from clause
(driving table), reordering the AND/OR conditions in the cheapest order etc

Anyway, you got the points!

/ Ronnie

Commented:
> I have allready applied all the other standard optimizations such as reordering the tables in from clause
(driving table), reordering the AND/OR conditions in the cheapest order etc

Are you using RBO then?  Because messing around with the ordering of the select statement has no influence on CBO.  CBO requires up to date statistics from analyze table or DBMS_STATS and suitable indexes to use as required.

Author

Commented:
I don't know, but i guess has to be RBO since it affects performance...

How often do one have to analyze table?
Say, a table that has 5 or 6 indexes and gets around 100 inserts, 500 updates and no deletes in one day,
but a total of 400,000 new records/ year (due to yearly bulk inserts).

Commented:
Always analyze after a bulk insert.  In between, analyze often enough that the stats are never too far out.  If the table has a million records but the stats say 950,000 then they are reasonable; if they say 100,000 they are worthless.  You would probably want to set up a scheduled job to re-analyze tables every week, month or whatever seems appropriate.

If you have no stats, the RBO is used.  Since RBO is prehistoric, switching to CBO is likely to improve performance.  There are some query optimizations that simply aren't available under RBO.

Author

Commented:
Hmm... this is news to me. I use version 8...
Maybe I've got old books about Oracle performance tuning :)
Where can I read more about the difference between the RBO and CBO?
Maybe we achive better result without tweaking the queries by hand?

I'll tell the local unixguru to set schedule a job that analyzes the tables...
Thanks!
Really thank you :-)

Commented:
Try the Oracle Performance Tuning Guide:

http://technet.oracle.com/docs/products/oracle9i/doc_library/release2/server.920/a96533/optimops.htm#721

That is for 9i, but the CBO has been around since Oracle 7 if I remember correctly - certainly, Oracle 8 has it.  If you search around that site you should be able to find the correct documentation for your version.  (Registration is free).

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial