Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Left Join much slower than Join

Posted on 2008-10-25
13
Medium Priority
?
1,232 Views
Last Modified: 2012-05-05
SELECT * FROM table1
JOIN table2 USING (key2)
LEFT JOIN table3 JOIN table4 USING (key4) USING (key3)
WHERE table1.x=100
AND table2.y=200

All keys are primary keys and so indexed, and table1.x and table2.y are also indexed. VACUUM ANALYZE has been run.

This query takes a very slow 17 seconds. I can see from the query plan that this is due to sequence scans on table3 and table4.

If I change it from a LEFT JOIN to a JOIN then it takes under 0.5 seconds and it correctly uses the indexes on table3.key3 and table4.key4.

It's also fast (<0.5s) if I remove the join to table4, OR the join to table2 OR the last "AND table2.y=2".

So, as its any of a number of things that drastically speed it up, it seems its the complexity of the query rather than any lack of indexing that's causing the slow behaviour.

Any ideas why this would be ? Is there any way I can force it to use the indexes on table3 and table4 ? Am I missing something ?

Regards,
Mark
0
Comment
Question by:sl311
  • 6
  • 4
  • 2
13 Comments
 
LVL 143

Accepted Solution

by:
Guy Hengel [angelIII / a3] earned 750 total points
ID: 22802245
what about thiws:
SELECT * 
FROM table1 t1
JOIN table2 ON t1.key2 = t2.key2 AND t2.y=200
LEFT JOIN table3 t3
  ON t3.key3 = t2.key3
LEFT JOIN table4 t4
  ON t4.key4 = t3.key4
WHERE table1.x=100

Open in new window

0
 

Author Comment

by:sl311
ID: 22802298
Hi,

Well I have to say when I saw your response I was a little skeptical that it would even run because moving the order of the table4 join and changing it to a left join looked to me like it was in the wrong place, but I was wrong, and it returns the right results, but more importantly in <0.5s. The moving of t2.y=200 wasn't the issue because I moved that back to where it was originally and it was still fast.

So

LEFT JOIN table3 USING (key3)
LEFT JOIN table4 ON (table4.key4=table3.key4)

Is significantly (x34) faster than:

LEFT JOIN table3 JOIN table4 USING (key4) USING (key3)

The second one looks more correct to me because it's table3 and table4 that are really being joined and the USING clause is more compact  than ON, but I can't see a problem with the way you suggest.

Any ideas why there is such a leap in performance for what is in effect the same query ? Why would one use the indexes and not the other ?

Thanks,
Mark
0
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 22802317
I have to be honest, I don't know postgre sql at all. I only suggested the "what I know in mssql works fine" :)
still, even there, I don't know why a left join sometimes gives bad results, however the reason in your case is that the database FIRST did the join between table3 and table4, and then the join with the rest.
however, joining first table3 to the results of the other joins, and then table4 is eventually more effective, as less rows will have to be handled most likely
0
Veeam Disaster Recovery in Microsoft Azure

Veeam PN for Microsoft Azure is a FREE solution designed to simplify and automate the setup of a DR site in Microsoft Azure using lightweight software-defined networking. It reduces the complexity of VPN deployments and is designed for businesses of ALL sizes.

 

Author Comment

by:sl311
ID: 22802356
Thanks, well even if it joined table3 to table4 first why wouldn't it use the index on table4.key4 for that join rather than doing a Seq Scan on table4 first which is what the query plan is showing.

I use the format of "LEFT JOIN table JOIN table" a lot in various queries so I'd rather not change them all especially as it's still a little unclear why this speeds things up, so I'll leave this question open for a few days, and hopefully someone can shed some light on why the Postgres query optimiser chooses Seq Scans when indexes are available and evidently much faster.

Thanks for your help,
Mark
0
 
LVL 3

Expert Comment

by:brycen
ID: 22860181
For deeper help, post the results of "explain analyze" with your selects.  Usually analysis of that output makes it clear what the underlying issue is.  In particular this will show the number of rows "examined" but later filtered out (which is no doubt the root of the problem here).
0
 

Author Comment

by:sl311
ID: 22860746
Hi,

We think the underlying issue is that the shared_buffers is not big enough for the query planner to plan using indexing. We have to adjust the shared memory in the Linux kernel, before we can upgrade this in Postgres (otherwise it gives us an error). This is going to take a week or so before we can do it, so once we've done that I'll post back whether that was indeed the problem.

If it isn't then I will post the explain analyze output.

Thanks for your input.
Mark
0
 

Author Comment

by:sl311
ID: 23027366
Hi,

Well we made the change but it wasn't the shared_buffers at all - so back to the drawing board !

I think this article clears up some of the problem:
http://www.postgresql.org/docs/8.2/interactive/explicit-joins.html

Although I'm still confused - under the bit:
SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);

it says:
"Therefore the planner has no choice of join order here: it must join B to C and then join A to that result."

I'm not sure I agree with that - why couldn't it join b to a to find those records from b that are to be returned, and then join c to the records from b just discovered ? That would mean 2 index scans, rather than an index scan and a sequence scan wouldn't it ? I can't see any ambiguity that would prevent this ?

Thanks,
Mark
0
 
LVL 3

Expert Comment

by:brycen
ID: 23032278
Also give the results of the following.  Too small a stats target can throw off the planner.



# show default_statistics_target;
+---------------------------+
| default_statistics_target |
+---------------------------+
| 50                        |
+---------------------------+

Open in new window

0
 

Author Comment

by:sl311
ID: 23035577
Thanks, it's currently the default of 10 - I'll up it to 50 this evening, do an Analyze and see what happens.
0
 
LVL 3

Expert Comment

by:brycen
ID: 23039323
"explain analyze"
Up it to 100
"explain analyze" again,
and compare the planner results.

Then you can back off from 1000 as you see fit.
0
 
LVL 3

Expert Comment

by:brycen
ID: 23039325
Make that 1,000.  Start at 1000, and see what difference it makes...
0
 

Author Closing Comment

by:sl311
ID: 31509912
Gave me a workaround - didn't tell me why the workaround works though.
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Best database to use for Maps is PostgreSQL. This is an open source database. Comes as a package with most Linux OS. For more info visit the following site: http://www.postgresql.org/ (http://www.postgresql.org/) This requires some add-o…
Many developers have database experience, but are new to PostgreSQL. It has some truly inspiring capabilities. I have several years' experience with Microsoft's SQL Server. When I began working with MySQL, I wanted a quick-reference to MySQL (htt…
Steps to create a PostgreSQL RDS instance in the Amazon cloud. We will cover some of the default settings and show how to connect to the instance once it is up and running.
this video summaries big data hadoop online training demo (http://onlineitguru.com/big-data-hadoop-online-training-placement.html) , and covers basics in big data hadoop .

916 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question