Link to home
Start Free TrialLog in
Avatar of marrowyung
marrowyung

asked on

speed up query using CTE and derived tables

Dear all,

other than set based query instead of while loop and cursor to speed up query, I read that CTE and derived tables can also do it, please explain the login, CTE and derived tables only designed for MS SQL and not for Oracle ?

set based is faster or CTE and derived tables ?
Avatar of Jim Horn
Jim Horn
Flag of United States of America image

You'll have to be more specific as to what you're asking, as I'm reading this as 'Can CTE's work'?, and the answer is yes.  Here's a couple of articles I have that use CTE's

SQL Server Delete Duplicate Rows Solutions (scroll down to DELETE #1 HARD DELETE) uses a single CTE to rank order a set of data based on defined columns, and delete all rows except for the first of another defined column.

T-SQL: Identify bad dates in a time series uses a big honkin' five nested CTE to slice through a timer series and identify gaps, overlaps, and duplicate rows.
ASKER CERTIFIED SOLUTION
Avatar of Kent Olsen
Kent Olsen
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>"set based is faster or CTE and derived tables"

Both CTE's and derived tables use "set based" SQL, so that sentence isn't correct.

a "derived table" is like this:

select * from table1
inner join (

  select customerid, count(*) from orders group by customerid

  )  derived_table on  table1.customerid = derived_table.customerid

a CTE is like this:

with CTE as (

  select customerid, count(*) from orders group by customerid

  )
select * from table1
inner join CTE  on  table1.customerid = CTE.customerid

So what you are asking has no specific answer I'm afraid.
Avatar of marrowyung
marrowyung

ASKER

Jim Horn,

"You'll have to be more specific as to what you're asking, as I'm reading this as 'Can CTE's work'?, and the answer is yes.  Here's a couple of articles I have that use CTE's"

I just want to study if CTE can give query a better performance than cursor and while loop, I finish study the set based query and I know it is better and I am not sure if CTE and derived tables can help on improving query performance too ? is it even better than set-based query ? or it just an alternative to set-based query in case in some situation set based query can't be use.

Kdo,

good to see you again here:

"A derived table is any temporary table that SQL creates on the fly to hold temporary results.  If you join two tables and select columns from the joined results, you're actually selecting data from a derived table."

store in tempDB for example ? like # table and ##table ?

"The nature of these tables make it impossible to index them, so writing a query to minimize the size of the intermediate results that are stored in derived tables can have a huge impact on performance."

so derived table is not good from your point of view ? # and ##table can be index however. a lot of developer program using that kind of table.

"Regarding the CTE, the theory is that each item in the CTE results in a derived table.  If you reference that item more than once there can be a performance boost as SQL should save the results for use later in the query.  The data is read and generated only once but can be referenced many times. "

so it just like SP that it is in the cache and therefore SQL will look at procedure cache every time once the CTE rans and if derived table still there?

" There are cases where the data cannot be preserved between calls, but for the most part they can."

just like #table and ##table ?

Paul Maxwell,

"Both CTE's and derived tables use "set based" SQL, so that sentence isn't correct.
"

as I am not a developer but would like to tell my developer on the best way to program T-SQL, so that's why I ask and as I said, that's why I just done the RBAR study and find way to convert it to set-based.  now I need more help on this as I read some notes that set-based query is not the only one method to improve the query performance, and I heard that CTE and derived table can help, please share some business case and script sample on why that is better.

"a CTE is like this:

with CTE as (

  select customerid, count(*) from orders group by customerid

  )
select * from table1
inner join CTE  on  table1.customerid = CTE.customerid"

so when we do this the CTE will create a derived table on RAM so that any query can refer that without read from disk again ?
one thing is I am also studying Hidden RBAR: Triangular Joins, so you know what I mean now !
Your question is so broad it just does not have a single/simple answer. SQL performance tuning does have some "general approaches" but each query has its own set of issues to deal with (the tables, the indices, the data itself and the requirements it aims to satisfy - for example).

Let me illustrate.

An incredibly common - but VERY BAD - technique used by some query writers is to just add "select distinct" at the top of a massive query. HERE, in this situation, using "derived tables" can be an enormous advantage because we can use those to remove unwanted repetition of data earlier in the query. But, not every poorly performing query starting with "select distinct" can be solved by just a derived table or two, there might be missing indexes (for example).

CTE's
To my eyes, I see them used FAR TOO OFTEN. In my earlier comment above I provided a derived table and CTE that were exactly equivalent. I would (me, myself, personally) always use the derived table in preference to CTE when they exactly equivalent.

HOWEVER I definitely would use a CTE IF:
1. there is "repitition" involved in the query, and I can avoid repeating a subquery. CTEs are brilliant for this because you can re-use them (which you cannot do by traditional derived tables).
2. there is "recursion" needed to solve the problem. A "recursive CTE" can do things that are difficult to achieve by other methods.

So; both "derived table" and CTE are extremely useful, but choosing which depends on the specific needs of a query.
"HOWEVER I definitely would use a CTE IF:
1. there is "repitition" involved in the query, and I can avoid repeating a subquery. CTEs are brilliant for this because you can re-use them (which you cannot do by traditional derived tables).
2. there is "recursion" needed to solve the problem. A "recursive CTE" can do things that are difficult to achieve by other methods."

so the benefit here is CTE in memory but not in disk, so keep using it make SQL server make use the same thing in memory and therefore faster ?

" using "derived tables" can be an enormous advantage because we can use those to remove unwanted repetition of data earlier in the query"

how derived table can be better than select distinct on removing repeated data ?
I gave the reason: because the repitition is avoided EARLIER. I have an article on this topic, look for  "Select distinct is retuning duplicates".
Hi marrowyung,

The programmer has no direct control over derived tables.  That's a SQL internal construct that cannot accessed directly.

SELECT *
FROM table_1
INNER JOIN table_2
  ON table_1.id = table_2.id
LEFT JOIN table_3 ON
  table_1.id = table_3.id

Open in new window


To complete that query, the RDBMS will first join table_1 and table_2.  The results of that join have to be stored some place, and that place is a derived table.  It behaves exactly like a user-defined table in that it has columns, data types, stores row data, etc.  But the table's life is the duration of the query.  When the query ends, the RDBMS discards the derived table and its contents.  They are an integral part of SQL and in fact make complex queries possible.  In general, user queries are faster when the size of the derived tables are kept to a minimum.  

The first join in that example is an inner join of two tables.  Assuming that both tables are very large, if the result of the join is modest (perhaps 1,000 rows) and the two tables are properly indexed, the results that are stored in the derived table can be produced almost instantaneously.  The rows of table_3 are then joined to the 1,000 rows.  Total run time is less than a second.

Now let's changed the order of the join operations, joining table_1 and table_3 first.  The results of that join will be at least as many rows as are in table_1.  If it contained 10,000,000 rows, 10,000,000 rows (plus any rows generated by the Cartesian product of duplicated join keys) are written to the derived table.  Inner joining table_2 to that result will filter the final result down to the 1,000 rows.  But 10,000,000 intermediate results are generated to get there.  Because the derived table cannot be indexed, the final inner join will do a full table scan of the 10,000,000 rows to inner join them to table_2.

Derived tables are our friends.  SQL won't work without them.  But knowing how they're used can go a long way toward writing better SQL!

Now let's put that query into a CTE.
WITH someCTE (id, postdate, text)
AS
(
  SELECT t1.id, t2.postdate, t3.text
  FROM table_1
  INNER JOIN table_2
    ON table_1.id = table_2.id
  LEFT JOIN table_3 ON
    table_1.id = table_3.id
)
SELECT someCTE.*, table_4.*
FROM someCTE
LEFT JOIN table_4
  ON table_4.postdate = someCTE.postdate
WHERE table_4.text in (SELECT substr (someCTE.text, 1, 3 FROM someCTE)

Open in new window


That's a nonsensical query, but it shows that the results of the subquery in someCTE are used multiple times in the query.  The final rows are produced from the data returned by someCTE, the join uses different data than the selection, and the filter uses still different data.  The query could be written several different ways, but the key point is that the sub-query in someCTE is executed once.  The results are used multiple times.  And where are the results someCTE stored?  In a derived table!  :)

It's fairly common for a CTE to have multiple subqueries.

WITH cte1 AS
(
  SELECT * FROM ....
),
cte2 AS
(
  SELECT * FROM ....
),
cte3 AS
(
  SELECT * FROM ....
)
...

That can be a very effective way to simplify large complex queries.  


Good Luck!
Kent
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Paul Maxwell,

"I gave the reason: because the repitition is avoided EARLIER."

tks. yeah I can see it but need more understanding and by your query, CTE and derived tables looks the same and I am not sure why you prefer derived tables, they both in RAM but not in disk and also the whole set of data already load in memory so they are not going to load again and again, right? global data you mean ? but #table and ##table all in tempdb, therefore, still on disk?

"derived table" = temporary table hidden and only operate by SQL server and we can't see it  ?

""Select distinct is retuning duplicates"."
sorry can't search that out from EE here

"A "derived table" is essentially a statement-local temporary table created by means of a subquery in the FROM clause of a SQL SELECT statement. It exists only in memory and behaves like a standard view or table."

in memory.... so it is not on tempdb at all but the problem of the derived table is we just can't create index on top of it ? but this statement is from oracle and probably not for MS SQL ?

"Line 4 below is the subquery that results in a "derived table", accessed as if it was a real table at line 6."

so you mean we can index it therefore ?

KDO,

" That's a SQL internal construct that cannot accessed directly."

tks, got it. then no one should take care about it at all as SQL server will take care of it automatically and we only focus on CTE, right?

"To complete that query, the RDBMS will first join table_1 and table_2.  The results of that join have to be stored some place, and that place is a derived table. "

yeah, I am starting to think why people still use #table and ##table? ## or # runs in tempdb and is faster?

"They are an integral part of SQL and in fact make complex queries possible. "

it will store in tempdb I think, right ?

"In general, user queries are faster when the size of the derived tables are kept to a minimum.  "

then I don't understand this, the table has to be larger, right?

" But 10,000,000 intermediate results are generated to get there."

so you are assuming that table 3 or table 1 is not indexed, right? I am not sure why table 1 inner join table 2 is faster than table 1 join table 3.

"Because the derived table cannot be indexed, the final inner join will do a full table scan of the 10,000,000 rows to inner join them to table_2.
"

so derived table is not a normal table as it can't be index, so that's why developer from time to time use #table and ##table as both can be index and for large amount of data derived table is not good but #table and ##table ?

"The results are used multiple times.  And where are the results someCTE stored?  In a derived table!  :)"

yeah  ! internally !!  year it is our good friend BUT the point is, SQL server will create this for us automatically and we don't care about that, right?

please share some LINK for me to learn the real world use of CTE, so I think this is the one thing in this topic need to focus by human but not by machine/SQL server ,right?

"It's fairly common for a CTE to have multiple subqueries.

WITH cte1 AS
(
  SELECT * FROM ....
),
cte2 AS
(
  SELECT * FROM ....
),
cte3 AS
(
  SELECT * FROM ....
)
..."

is it a "triangular joins" I post about in this question: https://www.experts-exchange.com/questions/28707177/triangular-joins.html,  sub query with calculation which we need to avoid ?

so solve "triangular joins" by CTE is a good point ?


BTW, to solve the problem I post https://www.experts-exchange.com/questions/28707177/triangular-joins.html, do you think CTE (good for subquery>) and /or derived table can help ?
Select Distinct is returning duplicates ...

>>by your query, CTE and derived tables looks the same
that was deliberate and designed to stress that a major difference is in how they are referenced.
A CTE isn't much different to a derived table and I specifically identified when I prefer to use a CTE.

why do I prefer derived tables?
CTEs use resources that simple derived tables don't - this may affect Oracle more than SQL Server - but it is a practice I prefer to follow. I did stress this is a personal view.

>>"so you mean we can index it therefore ?""
no, not correct, not at all, incorrect, wrong
how on earth did you jump to that conclusion?
never at any time did I say you could index a derived table - because you cannot

they are "transitory"/ "ephemeral" things that you cannot touch using "alter"
on this I absolutely agree with KDO: the programmer has no ability to alter a derived table once it exists.

A derived table can be in memory, or it can be spooled to disk, this will depend on size of the derived table, available memory resources and/or the execution plan that is being followed for a query.

I think you should worry much less about the physical implementation (memory, tempdb etc) it is SQL Server's job to manage what resources it needs for a derived table or CTE. Once the need for them has passed those resources are released.
Hi marrowyung,

>> then no one should take care about it at all as SQL server will take care of it automatically and we only focus on CTE, right?

Paul and I are saying the same thing, albeit with different language.  You can write your SQL to manage how any derived tables are used and select data from the last derived table in the query/sub-query) but everything in between is out of your control.  You need to be aware of them and how they affect query performance.  Most explain plans will give you a good idea where derived tables are create and used in a query, and an estimate of you much work the DMBS must do to populate it.

>> I am starting to think why people still use #table and ##table? ## or # runs in tempdb and is faster?

People usually use temp tables for convenience.  If they suspect that a large number of rows will be needed in subsequent steps, they can index the temp table and often get a noticeable performance boost.  Others (often mistakenly) simplify their SQL by using multiple queries to populate the temp table when a simple UNION ALL will suffice.

>> "In general, user queries are faster when the size of the derived tables are kept to a minimum.  "
>>    then I don't understand this, the table has to be larger, right?

Like any table, its size is primarily recognized as the number of rows( and is affected by the average length of a row).  To write the best SQL, the number of rows in a derived should always be a concern.  The derived table cannot be indexed so access to the data can only be done using a full table scan.  It's fairly easy to write a query that joins two derived tables.  Joining two large derived tables could result in a very slow query.

>> so you are assuming that table 3 or table 1 is not indexed, right? I am not sure why table 1 inner join table 2 is faster than table 1 join table 3.

Even with proper indexing, join order can make a huge difference in performance.  If table_1 and table_2 have only a few common key values, and table_1 and table_3 have a large number of common key values, then join order is a significant factor in query performance.   Joining table_1 and table_2 will result in a small derived table that will be scanned once when table_3 is joined.  The INNER JOIN filters out most of the unwanted rows during the first join.  Joining table_1 and table_3 first will return at least as many rows as are in table_1.  10,000,000+ rows are in the derived table and must be scanned to join them to table_2.  The INNER JOIN when done in the second step will filter out the extra rows that the OUTER JOIN created, but 10,000,000 rows are created and processed to get there.

>> so derived table is not a normal table as it can't be index, so that's why developer from time to time use #table and ##table as both can be index and for large amount of data derived table is not good but #table and ##table ?

Mostly correct.  Depending on the data and how it's used, the indexing operation may (emphasis on may) take more time than simply accessing the data via a full table scan.

>> please share some LINK for me to learn the real world use of CTE, so I think this is the one thing in this topic need to focus by human but not by machine/SQL server ,right?

They're everywhere.  The CTE structure will also allow you to write recursive SQL.

  https://www.experts-exchange.com/articles/3618/Recursive-SQL-in-DB2-Converting-rows-to-columns.html
  https://www.experts-exchange.com/articles/3629/Recursive-SQL-in-DB2-Converting-Columns-to-Rows.html

>> is it a "triangular joins"

Not necessarily.  Each of the sub-queries is its own entity.  The relationship between them depends on the needs of the SQL.


Kent
Paul Maxwell,

"CTEs use resources that simple derived tables don't - this may affect Oracle more than SQL Server - but it is a practice I prefer to follow. I did stress this is a personal view."

yeah, tks. but what I want to find out is, even it is a personal view you must have your personal smart reason, that's why I ask.

"I think you should worry much less about the physical implementation (memory, tempdb etc) it is SQL Server's job to manage what resources it needs for a derived table or CTE. Once the need for them has passed those resources are released."

yeah, I am sorry about that as you said derived table can't be index and I am not sure why you like to use it rather than CTE, so I ignore about that. from my learn, SQL always like to see index even it will do index scan and still not good.

I respect you all on this.

KDO,

" The derived table cannot be indexed so access to the data can only be done using a full table scan.  It's fairly easy to write a query that joins two derived tables.  Joining two large derived tables could result in a very slow query."

tks and it is a good explaination, but we can't control the use of derived table anyway, right ? so why simply use # and ## table ?

but I don't understand as SQL server internally create the derived table on the fly, then how we can join 2x derived table ourselves.

"Even with proper indexing, join order can make a huge difference in performance"

tks so much, yeah you are right!


" If table_1 and table_2 have only a few common key values,"
.. " Joining table_1 and table_2 will result in a small derived table"

the data size of the join should not related by the common key values, right?

"Joining table_1 and table_2 will result in a small derived table that will be scanned once when table_3 is joined.  The INNER JOIN filters out most of the unwanted rows during the first join.  Joining table_1 and table_3 first will return at least as many rows as are in table_1."

so if there are more than 3 queries for 2 x join statement, we should put the queries can result on smaller data set on the first join statement (front) and let the large dataset by the second join statement on the last join ?

"Mostly correct.  Depending on the data and how it's used, the indexing operation may (emphasis on may) take more time than simply accessing the data via a full table scan."

Yeah, if the derived table has row number like 1000, then full table scan can be better instead of spending time to create index.

" The CTE structure will also allow you to write recursive SQL."

yeah, I am finding Video for it too

please also join the discussion over there:

https://www.experts-exchange.com/questions/28707177/triangular-joins.html
it is really very simple, without derived tables SQL would cease to work at all.

You only need to concern yourself with the indexing of tables NOT derived tables.
note:  # and ## ARE tables (not "derived tables")

I do not wish repeat my reasons for preferring derived tables over CTEs, I feel sure I cannot make myself more clear on this point. It stems from my (limited) knowledge of Oracle, scarce shared memory is consumed by CTEs that is not used by a derived table. With the exceptions of "recursion" and "re-use" there is no advantage in using a CTE, so why bother. I also disagree that they make code easier to read, which is sometimes used as a excuse for using CTEs.

So, please, this is my final statement on my preference for derived tables over CTEs.
"it is really very simple, without derived tables SQL would cease to work at all.

You only need to concern yourself with the indexing of tables NOT derived tables.
note:  # and ## ARE tables (not "derived tables")"

excellent, and I really think you can be a good teacher .. you are T-SQL developer for living ? I will read more other material on this area.

sorry ask long enough as just want to understand more optimize T-SQL developement/tuning.
tks all