Link to home
Create AccountLog in
Avatar of Starr Duskk
Starr DuskkFlag for United States of America

asked on

Converting IN to EXISTS and NOT IN to NOT EXISTS

Can someone help me convert this to an EXISTS, NOT EXISTS clause? Someone said an EXISTS performs better than an IN. (I also read this is no longer an issue with SQLServer 2008 optimization, but I want to try it both ways and see if it performs better or not):

 AND 
 ( 
TrainingCourse.TrainingCourseId IN (Select TrainingCourseId from TrainingCourseUnit WHERE UnitId IN (Select UnitId from EmployeeAccessUnitView where EmployeeId = Employee.EmployeeId))
 OR 
TrainingCourse.TrainingCourseId NOT IN (Select TrainingCourseId from TrainingCourseUnit)
 )

Open in new window


Please help me word it correctly.

(The TrainingCourseUnit table is NOT used in my parent query because it has a one to many relationship and I don't want to reproduce all those rows. I just need to include the Course if it is in the TrainingCourseUnit table or if units are assigned to that course in the TrainingCourseUnit  table.)

Thanks!
ASKER CERTIFIED SOLUTION
Avatar of edtechdba
edtechdba
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
SOLUTION
Avatar of Qlemo
Qlemo
Flag of Germany image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Avatar of Starr Duskk

ASKER

Just FYI... TCU.EmployeeID doesn't exist. That would be Employee.EmployeeId. Will that mess anything up? thanks!

AND 
(
  EXISTS (
    select *
      from TrainingCourseUnit TCU
      join EmployeeAccessUnitView EAUV
        on EAUV.UnitID     = TCU.UnitID
       and EAUV.EmployeeId = Employee.EmployeeId 
    where TrainingCourse.TrainingCourseId = TCU.TrainingCourseId 
  ) OR NOT EXISTS (
    select * from TrainingCourseUnit TCU
    where TrainingCourse.TrainingCourseId = TCU.TrainingCourseId
  )
) 

Open in new window

I tried that change and my query using IN/NOT IN ran consistently one minute, but the new one using EXIST took 1.15 seconds. So it would 15 seconds longer to run. Weird huh?
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
SOLUTION
Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
PortletPaul,

select * versus select 1 has been an issue in ancient releases of DBMS, but isn't anymore (since several years). MySQL and such might be an exception, though. MSSQL does not care anymore. A "*" as selector is better than using any field, as that might force less advanced optimizers to choose the wrong execution plan.
Oracle, MSSQL, and several other DBMS remove any column or expression in a subselect following (NOT) EXISTS, and the official use is EXISTS (SELECT * ...) indeed.
@Qlemo
I am an ancient of course. I don't doubt you (so re-running isn't likely to produce a difference) and it's good to know optimizers have caught up with this.

select 1 isn't a field, but a constant, and as such should avoid any IO etc.

be interesting to know if * is demonstrably better than a constant here - do you know if such evidence exists?
No, there should be no difference whether you choose a literal or *. I wasn't after pointing out your change would make it worse - it shouldn't have any effect, that's it.
great - thanks.
Sorry, yes 1 minute 15 seconds and that was changing only ONE Of the IN/NOT INs as a test. I will do it with my huge query on everything and then let you know the results.

Also, I had read to use "Select 1" instead of "select *" and didn't notice any difference in results. The query I'm testing typically runs 1.5 minutes, but I have another that runs around 4-9 minutes depending.

I did add more memory to the machine and CPU, which with all my posts yesterday and today trying to get better performance, I noticed no one suggested that. It was sitting at 3.84 Memory usage constantly, with 4 max. So now I have 8 and will likely go to 15 since that helped. That gave it some advantage but not a whole lot.

Anyway, I will change all my IN/NOT IN and put them in a method so I can change them back and forth and test both ways and see which is best and I'll do the same with the select 1 versus select *.

I also run the test about 5 times each because I know it can vary.

okay, thanks all!
Remember: You need to make the EXISTS/IN distinction for each subselect individually, and with a typical data collection.
concentrate on an current use of IN() where the contents may be large (or might grow to large)

and it is unlikely that "select 1" will make substantial (if any) difference so it seems.
I have added exists for everything. Here is one example?

 AND 
 ( 
 EXISTS ( 
 select 1 
 from TrainingCourseJobType 
  where TrainingCourse.TrainingCourseId = TrainingCourseJobType.TrainingCourseId 
  and TrainingCourseJobType.JobTypeId = Employee.JobTypeId 
 ) OR NOT EXISTS ( 
 select 1 from TrainingCourseJobType 
 where TrainingCourse.TrainingCourseId = TrainingCourseJobType.TrainingCourseId 
 ) 
 ) 

Open in new window


Here are my results in seconds or minutes an seconds for the same two queries using IN and NOT IN versus EXISTS and NOT EXISTS. The IN always beats the EXISTS.

IN                  EXISTS
51                     51
1:20                 1:36
48                    51
48                    49
1:12               1:29
47                    49
I shall go and find a crow to eat...
v.interesting results, thanks for your dedication to testing

any chance of a .sqlplan file for each approach?
>>It took 4:29 to run this query on a client with a lot of employees and fruit. Too long. It doesn't take this long on a client with less employees and fruit.
this may be a silly question, but have these comparison tests been conducted "with a lot of employees and fruit"? The differential between use of IN() -v- EXISTS() will be seen most clearly when there is significant volume in those subqueries. For low volume IN() is OK.
yes, these tests are using a lot of employees and fruit because that's where the problem lies. Not millions, but definitely thousands as opposed to hundreds. I don't see us ever hitting the millions on one client.

>>any chance of a .sqlplan file for each approach?

I will do that.
Here is the first plan. Change the name to:

Plan_AnonymizedExists.queryanalysis

This was created with SQLSentry.
Plan-AnonymizedExists.txt
Here is the second plan, using IN.

Change the extension to "queryanalysis"

It is a SQL Sentry file.
Plan-Anonymized-IN.txt
BTW, for these queries, the results return 121,221 records for total tests taken.

I ran the two queries at the same time right now and they both completed in 48 seconds each.

(But I have more detailed queries that are taking 4-6 minutes each. They are similar but return additional data from other tables that this one does not, so I don't test with them because it takes too long to test.)

thanks.
:( unable to open those files using SQL Sentry (root element missing)
What other file does it need? Or how else should I build it?

thanks.
Thanks all. closing this out. I may open a new one later with the sql plans in there. thanks!
Thanks, hope you are making progress on this. Cheers, Paul.
Naww.... I have to come back to it later. Got a web service to do now. thanks!