Solved

The purge of data

Posted on 2012-03-28
11
272 Views
Last Modified: 2012-04-26
Dear all,

Right now we try to purge data but we find the process is very slow.

Our query is :
===================================================================
DECLARE

@NoOfMonthRetention      INT,@BusinessDate DATETIME

DELETE      tk FROM  task tk
WHERE user_activity_log_identifier IN
(SELECT  user_activity_log_identifier FROM  user_activity_log
WHERE  action_time      < DATEADD(MM, -@NoOfMonthRetention, @BusinessDate) )

=====================================================================

The execution plan of this is (see attached)

what I see is it delete something where xxxx in a result set from a select statement, it should be a expensive operation, how to solve it?

use more temp table as we use less join? but more temp will create more loading on tempDB, right? This is another problem.

DBA100.
execution-plan-before-change-1.jpg
execution-plan-before-change-2.jpg
execution-plan-before-change-3.jpg
execution-plan-before-change-4.jpg
0
Comment
Question by:marrowyung
11 Comments
 
LVL 15

Accepted Solution

by:
Anuj earned 167 total points
ID: 37776228
Create a clustered index on user_activity_log , the index key should be your primary key to avoid heap lookup, this improves performance for all queries refering user_activity_log table.

Another option is to create a covered index, make sure that the index is
CREATE INDEX IX_NCL_UserActicityLog_ActionTime
ON user_activity_log (action_time)
Include(user_activity_log_identifier)

OR

CREATE INDEX IX_NCL_UserActicityLog_ActionTime
ON user_activity_log (action_time,user_activity_log_identifier)
0
 
LVL 15

Assisted Solution

by:deepakChauhan
deepakChauhan earned 167 total points
ID: 37776518
Hi,

U doesn't need to create index if your table have minimal select operation.
u can use temp table like this

declare
@NoOfMonthRetention      INT,@BusinessDate DATETIME

SELECT  user_activity_log_identifier into ##t1 FROM  user_activity_log
WHERE  action_time      < DATEADD(MM, -@NoOfMonthRetention, @BusinessDate
go
DELETE    FROM  task
WHERE user_activity_log_identifier in ( select user_activity_log_identifier from ##t1)

or
u can use cte
0
 
LVL 9

Assisted Solution

by:rajeevnandanmishra
rajeevnandanmishra earned 166 total points
ID: 37776535
Hi,
If index doesn't make you happy, then you can use below code. It will definitely improve the performance.
---------
DECLARE @NoOfMonthRetention INT, @BusinessDate DATETIME

SELECT  user_activity_log_identifier into #myTemp FROM  user_activity_log
WHERE  action_time      < DATEADD(MM, -@NoOfMonthRetention, @BusinessDate)

set rowcount 1000

while 1=1
begin
      DELETE   tk
      FROM  task tk, #myTemp tp
      WHERE tk.user_activity_log_identifier = tp.user_activity_log_identifier

      if @@rowcount = 0
            break

end

--------
0
 
LVL 1

Author Comment

by:marrowyung
ID: 37780358
anujnb,

like this:

CREATE CLUSTERED INDEX [Newindex_user_activity_log] ON [dbo].[user_activity_log]
(
      [user_activity_log_identifier] ASC
)WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

exec sp_updatestats

Why index key should be your primary key to avoid heap lookup, what is heap lookup? cluseted index is the heap and if we defined clustered index on primary key it will do clustered index scan/lookup index?

deepakChauhan,

you mean load the result set to temp table, which is in tempDB, then we delete the object?

What is cte?

rajeevnandanmishra,

This example use more code to run than what deepakChauhan provides, it will run slower, right? you both insert the value in a temp table.

so you both prefer the IN statement in the tempDB, right?

But this will create contention in TempdB itself, right? and the size of TempdB will growth if this operation is a daily operation.

DBA100.
0
 
LVL 9

Expert Comment

by:rajeevnandanmishra
ID: 37780485
Hi,
My suggestion may looks like using a little more code. But it is actually it is going to run fast.
Not much, but a few points that i like to mention:

1. filtered data from user_activity_log has to be created (either via Index in the current DB, or via IN query in the tempdb, or via CTE or via TEMP Table in tempdb). So, it will not make too much of difference (unless you have very small amount of data that needed to be purged).

2. It must be noticed that, each DML run in an individual TRANSACTION. So, the size of transaction log will keep on increasing till it doesn't completes it. So, putting a limit on the affected rows, we are actually going to reduce the size of transaction log and locking period also. This will reduce the overall timing of your operation.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 1

Author Comment

by:marrowyung
ID: 37794846
by: anujnb,

After creating the clustered index:

CREATE CLUSTERED INDEX [_dta_index_user_activity_log_c_5_773577794__K1] ON [dbo].[user_activity_log]
(
      [user_activity_log_identifier] ASC
)WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

exec sp_updatestats

Please result is attached.

Question is : why there are so many process called parallelism just after adding one more index can the cost is 0, which is good.

Hash match is as powerful match and even it is costing 33%, but it should be ok, right?

But the clustered index delete is increase from 5% to 10%, not good ?

I have updated the client statistics picture from SSMS so that you all can see the before and after statistics.
execution-plan-after-change-1.jpg
execution-plan-after-change-2.jpg
execution-plan-after-change-3.jpg
execution-plan-after-change-4.jpg
client-statisitcs.jpg
0
 
LVL 1

Author Comment

by:marrowyung
ID: 37884936
rajeevnandanmishra,

"It must be noticed that, each DML run in an individual TRANSACTION. So, the size of transaction log will keep on increasing till it doesn't completes it. So, putting a limit on the affected rows, we are actually going to reduce the size of transaction log and locking period also. This will reduce the overall timing of your operation. "

that's why the loop count is 1000, right ?


still wondering why first insert to temp table will be faster.

deepakChauhan,

your script return error:

declare
@NoOfMonthRetention      INT,@BusinessDate DATETIME

SELECT  user_activity_log_identifier into ##t1 FROM  user_activity_log
WHERE  action_time      < DATEADD(MM, -@NoOfMonthRetention, @BusinessDate
go
DELETE    FROM  task
WHERE user_activity_log_identifier in ( select user_activity_log_identifier from ##t1)

it said :

Msg 102, Level 15, State 1, Line 5
Incorrect syntax near '@BusinessDate'.
Msg 208, Level 16, State 0, Line 1
Invalid object name '##t1'.
0
 
LVL 1

Author Comment

by:marrowyung
ID: 37884946
from my point of view that, first insert one more tempdb but do the same  WHERE XXXX IN (select from temp table) do one more thing, so it should be slower.
0
 
LVL 9

Expert Comment

by:rajeevnandanmishra
ID: 37891328
@marrowyung:
Yes, by putting the rowcount to 1000, we are minimizing the total transaction size. This will minimize the size allocation required in tempdb & transaction log. So, performance point of view this is definitely going to be faster.

Sorry, but i am missing the original question. What do you want to know in this part?
0
 
LVL 1

Author Comment

by:marrowyung
ID: 37895120
rajeevnandanmishra,

Do just do it 1000 times? and no more ? it make sure everything in temp table and the task table deleted ?

DBA100.
0
 
LVL 9

Expert Comment

by:rajeevnandanmishra
ID: 37895615
@marrowyung:

I have assumed that you will right the given code in a stored procedure. In that case the while loop will continue deleting the data, till it exists. Also, the temp table will be removed automatically by sql server after the execution of procedure. Though you can always drop the temp table by putting
drop table #myTemp
in the last.
The number1000, is just an example. Depending on the size of row, this value can be changed. When i use, i try to make my data deletion not of more than an MB (that goes normally around 1000 rows).
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now