Solved

T-SQL Left Join Query Optimization

Posted on 2012-04-11
6
768 Views
Last Modified: 2012-06-21
I have a query (see below) that performs quite well when joined with one table, but is exponentially increasing in time as I add in columns from an additional table. I'm sure I must be missing something simple in the way I am thinking of how this query executes, but looking at the execution plan, I'm not sure exactly what it is.

Q: How best to optimize the following to run on two tables both containing 20 Hz data.

DECLARE @num_ms int
DECLARE @startDate datetime
DECLARE @endDate datetime

SET @startDate = '2012-03-07 18:43:00.000'
SET @endDate = '2012-03-08 18:43:00.000' --GETDATE();
SET @num_ms = 60000 

;WITH DatetimeSeq(DateVal) as
(
    SELECT @startDate as DateTime
        UNION ALL
    SELECT DATEADD(MILLISECOND, @num_ms, DateVal)
        FROM DatetimeSeq
        WHERE DateVal < @endDate
)

SELECT	 t.DateVal
		,CAST(AVG(b.strain_1) AS Decimal(6,3)) AS s1_bar
    	,CAST(AVG(b.strain_2) AS Decimal(6,3)) AS s2_bar
    	,CAST(AVG(b.strain_3) AS Decimal(5,2)) AS s3_bar
    	,CAST(AVG(b.strain_4) AS Decimal(5,2)) AS s4_bar
    	,CAST(AVG(b.strain_5) AS Decimal(5,2)) AS s5_bar
    	,CAST(AVG(b.strain_6) AS Decimal(5,2)) AS s6_bar
    	,CAST(AVG(b.strain_7) AS Decimal(5,2)) AS s7_bar
    	,CAST(AVG(b.strain_8) AS Decimal(5,2)) AS s8_bar
    	,CAST(AVG(b.strain_9) AS Decimal(5,2)) AS s9_bar
		,CAST(AVG(a.strain_10) AS Decimal(6,3)) AS s10_bar
    	,CAST(AVG(a.strain_11) AS Decimal(6,3)) AS s11_bar
    	,CAST(AVG(a.strain_12) AS Decimal(6,3)) AS s12_bar
    	,CAST(AVG(a.strain_13) AS Decimal(5,2)) AS s13_bar
    	,CAST(AVG(a.strain_14) AS Decimal(5,2)) AS s14_bar
    	,CAST(AVG(a.strain_15) AS Decimal(5,2)) AS s15_bar
    	,CAST(AVG(a.strain_16) AS Decimal(5,2)) AS s16_bar
    	,CAST(AVG(a.strain_17) AS Decimal(5,2)) AS s17_bar
    	,CAST(AVG(a.strain_18) AS Decimal(5,2)) AS s18_bar
    	,CAST(AVG(a.strain_19) AS Decimal(5,2)) AS s19_bar
    	,CAST(AVG(b.strain_20) AS Decimal(5,2)) AS s20_bar
FROM DatetimeSeq t WITH(NOLOCK)
LEFT OUTER JOIN table1 a WITH(NOLOCK)
ON a.timestamp >= t.dateval AND a.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval)
LEFT OUTER JOIN table2 b WITH(NOLOCK)
ON b.timestamp >= t.dateval AND b.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval)
GROUP BY t.dateval
ORDER BY t.dateval
option (MaxRecursion 0)

Open in new window


Thanks in advance.
0
Comment
Question by:pigouvian
6 Comments
 
LVL 16

Expert Comment

by:kmslogic
ID: 37836699
Well

Verify that you have indexes on the timestamp fields in table1 and table2.
It should be optimized correctly, but perhaps SQL Server is having a problem with your first table and is somehow forcing itself to re-evaluate the averages on the first SELECT statement (which is only going to return one row) so you may want to create a temp table for that query before joining the other two tables.
Because your join condition may contain multiple records you are potentially going to have a multiplicative effect on the number of records.
0
 
LVL 45

Expert Comment

by:Vitor Montalvão
ID: 37836752
Try the same query without CAST function, just to check how much quick is it.
And do you really need LEFT OUTER JOIN?

Good luck
0
 
LVL 29

Accepted Solution

by:
Olaf Doschke earned 350 total points
ID: 37836821
kmslogic said>you are potentially going to have a multiplicative effect on the number of records.

This is the main point as I see it. You don't see how much records you multiply, because you build averages.

Even if the timestamp range is small you won't want thos to happen:

a.timestamp
1.001 data
1.002 data
1.003 data

b.timestamp
1.002 data
1.003 data
1.004 data

data simply is a placeholder for all the other fields you join,

If you limit the joined data in the range of 1.000 to 1.003 this would mean you get these joins
a.1.001 a.data b.1002 b.data
a.1.001 a.data b.1003 b.data
a.1.002 a.data b.1002 b.data
a.1.002 a.data b.1003 b.data
a.1.003 a.data b.1002 b.data
a.1.003 a.data b.1003 b.data

And not what you might expect:

a.1.001 a.data NULL NULL
a.1.002 a.data b.1002 b.data
a.1.003 a.data b.1003 b.data

This multiplication effect doesn't ruin the averages, but it creates much too much data to aggregate

This is what I would try: Full outer join data of a an b, and then left outer join that data with t in the wanted timestamp range:

SELECT       t.DateVal
            ,CAST(AVG(c.strain_1...
.....
.....
FROM DatetimeSeq t WITH(NOLOCK)
LEFT OUTER JOIN
(SELECT ISNULL(a.timestamp,b.timestamp) timestamp, a.*,b.*
FROM table1 a WITH(NOLOCK)
FULL OUTER JOIN table2 b WITH(NOLOCK) On a.timestamp == b.timestamp
) c
ON c.timestamp >= t.dateval AND c.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval)

I strongly assume the last join condition on c.timestamp will limit the joined rows of the temp result c, as it is created.

It might be better anyway, to keep your orignal query for one table only, then do a seperate query only joining the other table. Because if there are few matches of a.timestamp and b.timestamp you generate a lot of NULLs not influencing AVG but still reducing the performance.

You can later join the partial results of table1 only and table2 only via full outer join on their Dateval column.

Bye, Olaf.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 24

Assisted Solution

by:Bitsqueezer
Bitsqueezer earned 150 total points
ID: 37836828
Hi,

maybe this variant performs faster than the one above:

DECLARE @num_ms		int;
DECLARE @startDate	datetime;
DECLARE @endDate	datetime;

SET @startDate	= '2012-03-07 18:43:00.000';
SET @endDate	= '2012-03-08 18:43:00.000' --GETDATE();
SET @num_ms		= 60000;

IF OBJECT_ID('tempdb..#tmpSeq') IS NOT NULL DROP TABLE #tmpSeq;

CREATE TABLE #tmpSeq	(	DateVal1	DateTime,
							DateVal2	DateTime,
							PRIMARY KEY (DateVal1), UNIQUE(DateVal1,DateVal2)
						);

WITH DatetimeSeq AS
(
    SELECT @startDate AS DateVal1, DATEADD(MILLISECOND, @num_ms, @startDate) AS DateVal2
        UNION ALL
    SELECT DATEADD(MILLISECOND, @num_ms, DateVal1), DATEADD(MILLISECOND, 2 * @num_ms, DateVal1)
        FROM DatetimeSeq
        WHERE DateVal1 < @endDate
)
INSERT INTO #tmpSeq (DateVal1, DateVal2)
SELECT DateVal1, DateVal2 FROM DatetimeSeq ORDER BY DateVal1
OPTION (MaxRecursion 0);

SELECT	 t.DateVal1
		,CAST(AVG(b.strain_1) AS Decimal(6,3)) AS s1_bar
    	,CAST(AVG(b.strain_2) AS Decimal(6,3)) AS s2_bar
    	,CAST(AVG(b.strain_3) AS Decimal(5,2)) AS s3_bar
    	,CAST(AVG(b.strain_4) AS Decimal(5,2)) AS s4_bar
    	,CAST(AVG(b.strain_5) AS Decimal(5,2)) AS s5_bar
    	,CAST(AVG(b.strain_6) AS Decimal(5,2)) AS s6_bar
    	,CAST(AVG(b.strain_7) AS Decimal(5,2)) AS s7_bar
    	,CAST(AVG(b.strain_8) AS Decimal(5,2)) AS s8_bar
    	,CAST(AVG(b.strain_9) AS Decimal(5,2)) AS s9_bar
		,CAST(AVG(a.strain_10) AS Decimal(6,3)) AS s10_bar
    	,CAST(AVG(a.strain_11) AS Decimal(6,3)) AS s11_bar
    	,CAST(AVG(a.strain_12) AS Decimal(6,3)) AS s12_bar
    	,CAST(AVG(a.strain_13) AS Decimal(5,2)) AS s13_bar
    	,CAST(AVG(a.strain_14) AS Decimal(5,2)) AS s14_bar
    	,CAST(AVG(a.strain_15) AS Decimal(5,2)) AS s15_bar
    	,CAST(AVG(a.strain_16) AS Decimal(5,2)) AS s16_bar
    	,CAST(AVG(a.strain_17) AS Decimal(5,2)) AS s17_bar
    	,CAST(AVG(a.strain_18) AS Decimal(5,2)) AS s18_bar
    	,CAST(AVG(a.strain_19) AS Decimal(5,2)) AS s19_bar
    	,CAST(AVG(b.strain_20) AS Decimal(5,2)) AS s20_bar
  FROM #tmpSeq t
  LEFT OUTER JOIN table1 a WITH(NOLOCK) ON a.[timestamp] >= t.DateVal1 AND a.[timestamp] < t.DateVal2
  LEFT OUTER JOIN table2 b WITH(NOLOCK) ON b.[timestamp] >= t.DateVal1 AND b.[timestamp] < t.DateVal2
GROUP BY t.DateVal1
ORDER BY t.DateVal1;

DROP TABLE #tmpSeq;

Open in new window


Here the recursive query is used to fill a temp table with two fields, the first and the last compare value. Both are used to create indexes so they can be used in the query later.

The query now don't need to calculate the "to" value with DateAdd in each row, it is contained in the temp table as DateVal2.

Of course you should have defined an index on the "timestamp" column in the tables "table1" and "table2" and you should not use the name "timestamp" as column name as it is also a reserved word for the timestamp datatype (which is not compatible with DateTime).

I can only test the part which creates the compare table, this one works. The execution plan will give you a hint if it must use a table scan (slow) or can perform with the indexes of the tables.

Another thing which maybe speed up the output is to remove the cast to decimal. If you don't need the formatted output directly in the query output you could better format the output where it is used, for example in the frontend.

Cheers,

Christian
0
 

Author Closing Comment

by:pigouvian
ID: 37853887
Thanks, Olaf and Christian. I've worked with a combination of the modifications that you both suggested and there were marginal improvements, though the query still takes exceptionally long to execute.

Your help is appreciated. If you think of any alternatives to further improve performance, please let me know.
0
 
LVL 29

Expert Comment

by:Olaf Doschke
ID: 37869879
This subquery

(SELECT ISNULL(a.timestamp,b.timestamp) timestamp, a.*,b.*
FROM table1 a WITH(NOLOCK)
FULL OUTER JOIN table2 b WITH(NOLOCK) On a.timestamp == b.timestamp
) c

then obviously is still not good optimisable. I said it depends on your data, also.

It seems the join condition ON c.timestamp >= t.dateval AND c.timestamp < DATEADD(MILLISECOND,@num_ms,t.dateval) is applied afterwards.

You could optimise how much of table1 and 2 is joined into the subquery result c, by determining MIN(t.dateval) and MAX(DATEADD(MILLISECOND,@num_ms,t.dateval)) and put these min/max datetimes to limit c data.

I`d rather go the other route recommended, then and process table1 and table2 in two seperate queries and only join their results.

Bye, Olaf.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Hi all, It is important and often overlooked to understand “Database properties”. Often we see questions about "log files" or "where is the database" and one of the easiest ways to get general information about your database is to use “Database p…
Use this article to create a batch file to backup a Microsoft SQL Server database to a Windows folder.  The folder can be on the local hard drive or on a network share.  This batch file will query the SQL server to get the current date & time and wi…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now