T-SQL Left Join Query Optimization

I have a query (see below) that performs quite well when joined with one table, but is exponentially increasing in time as I add in columns from an additional table. I'm sure I must be missing something simple in the way I am thinking of how this query executes, but looking at the execution plan, I'm not sure exactly what it is.

Q: How best to optimize the following to run on two tables both containing 20 Hz data.

DECLARE @num_ms int
DECLARE @startDate datetime
DECLARE @endDate datetime

SET @startDate = '2012-03-07 18:43:00.000'
SET @endDate = '2012-03-08 18:43:00.000' --GETDATE();
SET @num_ms = 60000 

;WITH DatetimeSeq(DateVal) as
(
    SELECT @startDate as DateTime
        UNION ALL
    SELECT DATEADD(MILLISECOND, @num_ms, DateVal)
        FROM DatetimeSeq
        WHERE DateVal < @endDate
)

SELECT	 t.DateVal
		,CAST(AVG(b.strain_1) AS Decimal(6,3)) AS s1_bar
    	,CAST(AVG(b.strain_2) AS Decimal(6,3)) AS s2_bar
    	,CAST(AVG(b.strain_3) AS Decimal(5,2)) AS s3_bar
    	,CAST(AVG(b.strain_4) AS Decimal(5,2)) AS s4_bar
    	,CAST(AVG(b.strain_5) AS Decimal(5,2)) AS s5_bar
    	,CAST(AVG(b.strain_6) AS Decimal(5,2)) AS s6_bar
    	,CAST(AVG(b.strain_7) AS Decimal(5,2)) AS s7_bar
    	,CAST(AVG(b.strain_8) AS Decimal(5,2)) AS s8_bar
    	,CAST(AVG(b.strain_9) AS Decimal(5,2)) AS s9_bar
		,CAST(AVG(a.strain_10) AS Decimal(6,3)) AS s10_bar
    	,CAST(AVG(a.strain_11) AS Decimal(6,3)) AS s11_bar
    	,CAST(AVG(a.strain_12) AS Decimal(6,3)) AS s12_bar
    	,CAST(AVG(a.strain_13) AS Decimal(5,2)) AS s13_bar
    	,CAST(AVG(a.strain_14) AS Decimal(5,2)) AS s14_bar
    	,CAST(AVG(a.strain_15) AS Decimal(5,2)) AS s15_bar
    	,CAST(AVG(a.strain_16) AS Decimal(5,2)) AS s16_bar
    	,CAST(AVG(a.strain_17) AS Decimal(5,2)) AS s17_bar
    	,CAST(AVG(a.strain_18) AS Decimal(5,2)) AS s18_bar
    	,CAST(AVG(a.strain_19) AS Decimal(5,2)) AS s19_bar
    	,CAST(AVG(b.strain_20) AS Decimal(5,2)) AS s20_bar
FROM DatetimeSeq t WITH(NOLOCK)
LEFT OUTER JOIN table1 a WITH(NOLOCK)
ON a.timestamp >= t.dateval AND a.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval)
LEFT OUTER JOIN table2 b WITH(NOLOCK)
ON b.timestamp >= t.dateval AND b.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval)
GROUP BY t.dateval
ORDER BY t.dateval
option (MaxRecursion 0)

Open in new window


Thanks in advance.
pigouvianAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

kmslogicCommented:
Well

Verify that you have indexes on the timestamp fields in table1 and table2.
It should be optimized correctly, but perhaps SQL Server is having a problem with your first table and is somehow forcing itself to re-evaluate the averages on the first SELECT statement (which is only going to return one row) so you may want to create a temp table for that query before joining the other two tables.
Because your join condition may contain multiple records you are potentially going to have a multiplicative effect on the number of records.
0
Vitor MontalvãoMSSQL Senior EngineerCommented:
Try the same query without CAST function, just to check how much quick is it.
And do you really need LEFT OUTER JOIN?

Good luck
0
Olaf DoschkeSoftware DeveloperCommented:
kmslogic said>you are potentially going to have a multiplicative effect on the number of records.

This is the main point as I see it. You don't see how much records you multiply, because you build averages.

Even if the timestamp range is small you won't want thos to happen:

a.timestamp
1.001 data
1.002 data
1.003 data

b.timestamp
1.002 data
1.003 data
1.004 data

data simply is a placeholder for all the other fields you join,

If you limit the joined data in the range of 1.000 to 1.003 this would mean you get these joins
a.1.001 a.data b.1002 b.data
a.1.001 a.data b.1003 b.data
a.1.002 a.data b.1002 b.data
a.1.002 a.data b.1003 b.data
a.1.003 a.data b.1002 b.data
a.1.003 a.data b.1003 b.data

And not what you might expect:

a.1.001 a.data NULL NULL
a.1.002 a.data b.1002 b.data
a.1.003 a.data b.1003 b.data

This multiplication effect doesn't ruin the averages, but it creates much too much data to aggregate

This is what I would try: Full outer join data of a an b, and then left outer join that data with t in the wanted timestamp range:

SELECT       t.DateVal
            ,CAST(AVG(c.strain_1...
.....
.....
FROM DatetimeSeq t WITH(NOLOCK)
LEFT OUTER JOIN
(SELECT ISNULL(a.timestamp,b.timestamp) timestamp, a.*,b.*
FROM table1 a WITH(NOLOCK)
FULL OUTER JOIN table2 b WITH(NOLOCK) On a.timestamp == b.timestamp
) c
ON c.timestamp >= t.dateval AND c.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval)

I strongly assume the last join condition on c.timestamp will limit the joined rows of the temp result c, as it is created.

It might be better anyway, to keep your orignal query for one table only, then do a seperate query only joining the other table. Because if there are few matches of a.timestamp and b.timestamp you generate a lot of NULLs not influencing AVG but still reducing the performance.

You can later join the partial results of table1 only and table2 only via full outer join on their Dateval column.

Bye, Olaf.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

BitsqueezerCommented:
Hi,

maybe this variant performs faster than the one above:

DECLARE @num_ms		int;
DECLARE @startDate	datetime;
DECLARE @endDate	datetime;

SET @startDate	= '2012-03-07 18:43:00.000';
SET @endDate	= '2012-03-08 18:43:00.000' --GETDATE();
SET @num_ms		= 60000;

IF OBJECT_ID('tempdb..#tmpSeq') IS NOT NULL DROP TABLE #tmpSeq;

CREATE TABLE #tmpSeq	(	DateVal1	DateTime,
							DateVal2	DateTime,
							PRIMARY KEY (DateVal1), UNIQUE(DateVal1,DateVal2)
						);

WITH DatetimeSeq AS
(
    SELECT @startDate AS DateVal1, DATEADD(MILLISECOND, @num_ms, @startDate) AS DateVal2
        UNION ALL
    SELECT DATEADD(MILLISECOND, @num_ms, DateVal1), DATEADD(MILLISECOND, 2 * @num_ms, DateVal1)
        FROM DatetimeSeq
        WHERE DateVal1 < @endDate
)
INSERT INTO #tmpSeq (DateVal1, DateVal2)
SELECT DateVal1, DateVal2 FROM DatetimeSeq ORDER BY DateVal1
OPTION (MaxRecursion 0);

SELECT	 t.DateVal1
		,CAST(AVG(b.strain_1) AS Decimal(6,3)) AS s1_bar
    	,CAST(AVG(b.strain_2) AS Decimal(6,3)) AS s2_bar
    	,CAST(AVG(b.strain_3) AS Decimal(5,2)) AS s3_bar
    	,CAST(AVG(b.strain_4) AS Decimal(5,2)) AS s4_bar
    	,CAST(AVG(b.strain_5) AS Decimal(5,2)) AS s5_bar
    	,CAST(AVG(b.strain_6) AS Decimal(5,2)) AS s6_bar
    	,CAST(AVG(b.strain_7) AS Decimal(5,2)) AS s7_bar
    	,CAST(AVG(b.strain_8) AS Decimal(5,2)) AS s8_bar
    	,CAST(AVG(b.strain_9) AS Decimal(5,2)) AS s9_bar
		,CAST(AVG(a.strain_10) AS Decimal(6,3)) AS s10_bar
    	,CAST(AVG(a.strain_11) AS Decimal(6,3)) AS s11_bar
    	,CAST(AVG(a.strain_12) AS Decimal(6,3)) AS s12_bar
    	,CAST(AVG(a.strain_13) AS Decimal(5,2)) AS s13_bar
    	,CAST(AVG(a.strain_14) AS Decimal(5,2)) AS s14_bar
    	,CAST(AVG(a.strain_15) AS Decimal(5,2)) AS s15_bar
    	,CAST(AVG(a.strain_16) AS Decimal(5,2)) AS s16_bar
    	,CAST(AVG(a.strain_17) AS Decimal(5,2)) AS s17_bar
    	,CAST(AVG(a.strain_18) AS Decimal(5,2)) AS s18_bar
    	,CAST(AVG(a.strain_19) AS Decimal(5,2)) AS s19_bar
    	,CAST(AVG(b.strain_20) AS Decimal(5,2)) AS s20_bar
  FROM #tmpSeq t
  LEFT OUTER JOIN table1 a WITH(NOLOCK) ON a.[timestamp] >= t.DateVal1 AND a.[timestamp] < t.DateVal2
  LEFT OUTER JOIN table2 b WITH(NOLOCK) ON b.[timestamp] >= t.DateVal1 AND b.[timestamp] < t.DateVal2
GROUP BY t.DateVal1
ORDER BY t.DateVal1;

DROP TABLE #tmpSeq;

Open in new window


Here the recursive query is used to fill a temp table with two fields, the first and the last compare value. Both are used to create indexes so they can be used in the query later.

The query now don't need to calculate the "to" value with DateAdd in each row, it is contained in the temp table as DateVal2.

Of course you should have defined an index on the "timestamp" column in the tables "table1" and "table2" and you should not use the name "timestamp" as column name as it is also a reserved word for the timestamp datatype (which is not compatible with DateTime).

I can only test the part which creates the compare table, this one works. The execution plan will give you a hint if it must use a table scan (slow) or can perform with the indexes of the tables.

Another thing which maybe speed up the output is to remove the cast to decimal. If you don't need the formatted output directly in the query output you could better format the output where it is used, for example in the frontend.

Cheers,

Christian
0
pigouvianAuthor Commented:
Thanks, Olaf and Christian. I've worked with a combination of the modifications that you both suggested and there were marginal improvements, though the query still takes exceptionally long to execute.

Your help is appreciated. If you think of any alternatives to further improve performance, please let me know.
0
Olaf DoschkeSoftware DeveloperCommented:
This subquery

(SELECT ISNULL(a.timestamp,b.timestamp) timestamp, a.*,b.*
FROM table1 a WITH(NOLOCK)
FULL OUTER JOIN table2 b WITH(NOLOCK) On a.timestamp == b.timestamp
) c

then obviously is still not good optimisable. I said it depends on your data, also.

It seems the join condition ON c.timestamp >= t.dateval AND c.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval) is applied afterwards.

You could optimise how much of table1 and 2 is joined into the subquery result c, by determining MIN(t.dateval) and MAX(DATEADD(MILLISECOND,@num_ms,t.dateval)) and put these min/max datetimes to limit c data.

I`d rather go the other route recommended, then and process table1 and table2 in two seperate queries and only join their results.

Bye, Olaf.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server 2008

From novice to tech pro — start learning today.