<

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x

How to find rows that are in one table, but not another

Published on
16,907 Points
6,107 Views
3 Endorsements
Last Modified:
Awarded

Introduction


A commonly seen question in the database zones is "How do I find data that is in table A but not in table B?".  Anytime I see a question asked repeatedly, it seems to me that it would be a good topic for an article.

Setup


For the purposes of this article, we're going to assume the following setup (based upon a system I work with).

I will run examples in both SQL Server and Oracle and we'll examine the execution plans as well.  The purpose of this article is not to compare and contrast the two engines, just to give examples of how both work and how each database system handles each of the different options.

For the SQL Server system:

There is a table called SUPPLIER that contains 3584 rows. There is a unique index (the primary key) on a column called SUPPLIER_NO. There is another table called ORDERS that contains 975059 rows. There is a unique index (the primary key) on a column called PO_NO. There is also an index on a combination of some other columns (SUPPLIER_NO, ORDER_DATE, CREATED_DATE, PO_NO).

The purpose of our demonstration here will be to find all suppliers with whom we have never placed orders; that is, the supplier number exists on SUPPLIER but not on ORDERS.

We will also assume that all database statistics are up to date. For the SQL Server system, in order to make the execution plan clearer, I'll be forcing it to single thread and remove parallelism and also performing dbcc dropcleanbuffers before the query to flush the buffer cache.

Finally, each of these queries return exactly the same data. I'm not showing that for security reasons, you'll have to take my word for it :)

Demo Time!


There are 3 ways to do this that immediately come to mind that will work for both SQL Server and Oracle. There is a 4th way that effectively does the same thing in both Oracle and SQL Server but is implemented slightly differently.

The first way is by the use of the NOT IN clause:
dbcc dropcleanbuffers

select supplier_no from supplier
where supplier_no not in (select supplier_no from orders)
option (maxdop 1)

Open in new window

This runs for about three seconds and returns the following execution plan:

Execution Plan - NOT IN
The second way is to use the NOT EXISTS clause:
dbcc dropcleanbuffers

select supplier_no from supplier a
where not exists (select 1 from orders b
where a.supplier_no = b.supplier_no )
option (maxdop 1)

Open in new window

Again, this runs for about three seconds and returns the following execution plan:

Execution Plan - NOT EXISTS
The third way is to use an outer join and check for the existence of NULLS to determine the missing rows:
dbcc dropcleanbuffers

select a.supplier_no
from supplier a
left join orders b on a.supplier_no = b.supplier_no
where b.supplier_no is null
option (maxdop 1)

Open in new window

Sticking with what we've seen previously, this also runs for about three seconds and produces the following execution plan:

Execution Plan - LEFT JOIN
The fourth way is using a SQL Server specific extension, called EXCEPT.  This runs two queries with syntax similar to that of a UNION and returns all rows from the first query that aren't in the second.
dbcc dropcleanbuffers

select supplier_no from supplier
except
select supplier_no from orders
option (maxdop 1)

Open in new window

As before, the query takes about 3 seconds to run and returns the following execution plan:

Execution Plan - EXCEPT
Examining these execution plans, what do we see ?   NOT IN and NOT EXISTS appear to produce the same execution plan.  Using the LEFT JOIN option produces a pretty short and concise plan and EXCEPT seems to do a little more work.

I don't want to make the definitive claim here that that one is more efficient than the other, especially since they all run for about the same time, but the LEFT JOIN certainly seems to produce the cleanest plan of the lot.

Now we'll try the same queries with Oracle and see what the Oracle optimizer thinks of them.

On the Oracle system is slightly different in table size than the SQL Server System, but the same tables and indexes are present. SUPPLIER has 15421 rows and ORDERS has 422009 rows. Before each query, we'll assume that alter system flush buffer_cache has been run as well to clear the buffer cache. Execution plan output is generated via SET AUTOTRACE ON from sqlplus.

The first query (NOT IN):
SQL> select supplier_no from supplier
  2  where supplier_no not in (select supplier_no from orders);

Open in new window

Generates the following plan:
Execution Plan
----------------------------------------------------------
Plan hash value: 1005948720

-------------------------------------------------------------------------------------
| Id  | Operation             | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |             | 15421 |   210K|   572   (2)| 00:00:07 |
|*  1 |  HASH JOIN ANTI       |             | 15421 |   210K|   572   (2)| 00:00:07 |
|   2 |   INDEX FAST FULL SCAN| SUPPLIER_PK | 15421 |   105K|    10   (0)| 00:00:01 |
|   3 |   INDEX FAST FULL SCAN| ORDERS_IX1  |   422K|  2884K|   559   (1)| 00:00:07 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("SUPPLIER_NO"="SUPPLIER_NO")

Open in new window


The second query (NOT EXISTS):
SQL> select supplier_no from supplier a
  2  where not exists (select 1 from orders b
  3  where a.supplier_no = b.supplier_no );

Open in new window

Produces this plan:
Execution Plan
----------------------------------------------------------
Plan hash value: 1005948720

-------------------------------------------------------------------------------------
| Id  | Operation             | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |             | 15421 |   210K|   572   (2)| 00:00:07 |
|*  1 |  HASH JOIN ANTI       |             | 15421 |   210K|   572   (2)| 00:00:07 |
|   2 |   INDEX FAST FULL SCAN| SUPPLIER_PK | 15421 |   105K|    10   (0)| 00:00:01 |
|   3 |   INDEX FAST FULL SCAN| ORDERS_IX1  |   422K|  2884K|   559   (1)| 00:00:07 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."SUPPLIER_NO"="B"."SUPPLIER_NO")

Open in new window

Our third query (LEFT JOIN):
SQL> select a.supplier_no
  2  from supplier a
  3  left join orders b on a.supplier_no = b.supplier_no
  4  where b.supplier_no is null;

Open in new window

Produces this plan:
Execution Plan
----------------------------------------------------------
Plan hash value: 1005948720

-------------------------------------------------------------------------------------
| Id  | Operation             | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |             | 15421 |   210K|   572   (2)| 00:00:07 |
|*  1 |  HASH JOIN ANTI       |             | 15421 |   210K|   572   (2)| 00:00:07 |
|   2 |   INDEX FAST FULL SCAN| SUPPLIER_PK | 15421 |   105K|    10   (0)| 00:00:01 |
|   3 |   INDEX FAST FULL SCAN| ORDERS_IX1  |   422K|  2884K|   559   (1)| 00:00:07 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."SUPPLIER_NO"="B"."SUPPLIER_NO")

Open in new window

FInally, the fourth one, replaces the SQL Server extension EXCEPT with the Oracle extension MINUS - otherwise they are identical.
SQL> select supplier_no from supplier
  2  minus
  3  select supplier_no from orders;

Open in new window

Finally, this plan is generated:
Execution Plan
----------------------------------------------------------
Plan hash value: 116939239

----------------------------------------------------------------------------------------------
| Id  | Operation              | Name        | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |             | 15421 |  2990K|       |  2089   (2)| 00:00:26 |
|   1 |  MINUS                 |             |       |       |       |            |          |
|   2 |   SORT UNIQUE          |             | 15421 |   105K|   256K|    66   (4)| 00:00:01 |
|   3 |    INDEX FAST FULL SCAN| SUPPLIER_PK | 15421 |   105K|       |    10   (0)| 00:00:01 |
|   4 |   SORT UNIQUE          |             |   422K|  2884K|  6632K|  2023   (2)| 00:00:25 |
|   5 |    INDEX FAST FULL SCAN| ORDERS_IX1  |   422K|  2884K|       |   559   (1)| 00:00:07 |
----------------------------------------------------------------------------------------------

Open in new window


Interestingly, Oracle produces the exact same execution plan for all three of the first three queries. The final way is a much different plan with an increased execution time.

Extra Considerations


If is possible that one, or both of the tables may not have such convenient indexes.  In the event that indexes are not available for evaluation by the optimizer, you will end up with full table scans - this will of course alter the execution plans show above and the way that the data returned from each step is treated.

For SQL Server, instead of Nested Loops to match the data returned from the 2 queries, it turns into a Hash Join.

For Oracle, all it did was to change the lookup of one table to a full table scan but the join method remained the same, as Hash Join Anti.

Each optimizer adapted to no indexes in a different way.  If you have to do this kind of query, having available indexes will improve the performance of finding the result set.  No available indexes won't stop the results being returned, it may just take longer, depending upon the volume of the data you are dealing with.

Conclusion


As with most things database related, there may be many ways to perform a task. In order to find the most efficient way for whatever it is you're attempting to do, examine the options and look at the output from the tools provided to find the best way for your situation.
3
Comment
Author:Steve Wales
1 Comment
 

Administrative Comment

by:Eric AKA Netminder
Steve,

Congratulations; your article has been published, and has been selected as EE-Approved.

ericpete
Page Editor
0

Featured Post

Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

Join & Write a Comment

This is a high-level webinar that covers the history of enterprise open source database use. It addresses both the advantages companies see in using open source database technologies, as well as the fears and reservations they might have. In this…
With just a little bit of  SQL and VBA, many doors open to cool things like synchronize a list box to display data relevant to other information on a form.  If you have never written code or looked at an SQL statement before, no problem! ...  give i…

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month