Solved

conditional delete on huge number of rows

Posted on 2013-06-14
23
267 Views
Last Modified: 2013-06-18
hi guys

I have two tables
Table1 and Table2

Table1
ID
1
2
3
4
..and so on

Table2
ID
2
3
4
5
...and so on

I am comparing both the tables in following way:

If Id of Table2 does not exists in Id of Table1

delte that Id from Table2.

My logic is
I am getting a collection of Table1 Ids and Table2 Ids
Looping through each Table2 id
Check if each Table2 id exists in Table1 Id collection
If not exists delete the id using this sql:


DELETE FROM table2 WHERE id = :id

There are 1 million Id in Table1 and more than 1 million Id in Table2

My overall process takes a long time (more than an hour)
what could be the reason for this?

Right now ID column is not indexed. Will indexing help?


Any sugessions will be great.
thanks
0
Comment
Question by:royjayd
  • 8
  • 5
  • 4
  • +3
23 Comments
 
LVL 16

Accepted Solution

by:
Surendra Nath earned 300 total points
ID: 39249210
Right now ID column is not indexed. Will indexing help?
Yes it will help a lot

and use the below code for better performance

delete T2 
from Table2 T2
where not exists ( select 1 from Table1 T1 where T1.id = T2.id)

Open in new window


if the above code is also taking more time then that might be because of the huge transaction log that is being generated in the back end.
This can be avoided by employing batch deletes as below

declare @minID int,@maxID int,@batchCount INT,@CurrentID INT
select @minID = min(id) ,@maxid = max(id) from Table2
set @batchCount = 5000
SET @CurrentID = @minId
while  @CurrentID <= @MaxID
BEGIN
delete T2 
from Table2 T2
where not exists ( select 1 from Table1 T1 where T1.id = T2.id)
and T2.ID BETWEEN  @CurrentID AND @CurrentID+@batchCount
SET @CurrentID = @CurrentID + @batchCount
END

Open in new window

0
 

Author Comment

by:royjayd
ID: 39249211
actually i clicked on 'Display Estimated Execution plan' (in SQL Server Management Studio)
It does say
Missing Index : Create nonclustered index on table2.id

do i always follow what the execution plan says ?
0
 
LVL 16

Expert Comment

by:Surendra Nath
ID: 39249231
Yes and No

The answer to your is it depends.
Creating an index might increase the performance of your query that is running right now...
But it might decrease the performance of an insert or update some where else.
And also remember indexes do take disk space, and it is unwise to have a lot indexes on a table.

Hope this helps out...
0
 
LVL 13

Expert Comment

by:Jesus Rodriguez
ID: 39249271
Why don't you do this

DELETE FROM T2
WHERE ID NOT IN (SELECT ID FROM T1)

If ID is a key in both tables will work pretty fast. I got a table with more than 4 millions of records and works fine.
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 39249340
regarding: IN()
use 'not exists' instead - as suggested by Neo_jarvis - it will perform better than IN()
0
 
LVL 35

Assisted Solution

by:David Todd
David Todd earned 140 total points
ID: 39249343
Hi,

Something to bear in mind is that you are dealing with a few million rows here. This might start to be more than the server can handle at once and so it may pay to batch this.

Assuming that you are referring to a production system, and that that system is in full recovery mode, and even if in simple recovery mode, that delete could be one huge transaction that is logged in total  and bloats your transaction log.

An easy way for SQL 2008R2 is

delete top 1000
from t2
where
    id not in (
        select ID from t1
        )

Obviously that will need to be put in a loop. Exit condition for the loop could be as simple as checking the @@RowsAffected system variable, or doing a count of rows still to delete.

And don't forget about putting a checkpoint command in the loop.

Sure this will slow the delete down, and maybe you could increase the top 1000 some, but do you want a batch size of a million rows to choke at 99% because growth of the log file has used all available disk, and its now rolling back. And by the way, has an exclusive table lock preventing other users doing anything ...

HTH
  David
0
 
LVL 31

Expert Comment

by:awking00
ID: 39253501
delete from table2 where id in
(select id from table2 except select id from table1)
And definitely create an index for id in both tables.
0
 

Author Comment

by:royjayd
ID: 39253801
i was googling and i found this

Instead of DELETE do this

select * into new_table from original_table where ...
go
drop table original_table
go
sp_rename 'new_table', 'original_table'

is that a better solution ?

thanks
0
 
LVL 31

Expert Comment

by:awking00
ID: 39253830
>>is that a better solution ?<<
Probably not. Doing it that way will cause a loss of any indexing, referential constraints, etc.
If you must,
create the newtable as select * from originaltable where ...
truncate originaltable;
insert into originaltable select * from newtable;
drop newtable;
But I don't see how that would improve performance over the proposed solutions.
0
 

Author Comment

by:royjayd
ID: 39254091
i am not sure also  but the stats are pretty intresting

<<For a 24 million row table, this operation can run in less than two minutes on a standard modern server. Be aware you'd need to recreate indexes and you could break foreign keys>>>

Here is the discussion

http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/ecd6c649-3613-460d-9fec-2aa1483256e8/
0
 
LVL 35

Expert Comment

by:David Todd
ID: 39254156
Hi,

>> select into newtable
If what you want left is a small fraction of the original then this might be worth while - give the hassle of re-creating indexes and referential integrity.

Otherwise I suggest the delete top 10000 I mentioned before ...

Regards
  David
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 31

Expert Comment

by:awking00
ID: 39254179
<<For a 24 million row table, this operation can run in less than two minutes on a standard modern server.>>
How do you know? Are indexes being used? Have statistics been gathered? How many processors? How much memory? Has logging been disabled?

<< Be aware you'd need to recreate indexes and you could break foreign keys>>>
Just as I stated.

How long does it take to run the following query?
select count(*) from
(select id from table2 except select id from table1)
0
 

Author Comment

by:royjayd
ID: 39254721
>>select count(*) from
(select id from table2 except select id from table1)

Is the sql correct ?
I get an error .. Incorrect syntax near ')'.

however this is some sample data i am running on:

select count(*) from table1
>1112653

select count(*) from table2
>1112663


select COUNT(*)
from table2 where
    ID not in (
        select ID from table1  )

>10

thx
0
 
LVL 48

Assisted Solution

by:PortletPaul
PortletPaul earned 30 total points
ID: 39254960
if the difference is just 10 then creating a new table of 1.1m rows is not warranted

& NOT EXISTS will generally outperform use of IN()

I'm still recommending the first reply: ID: 39249210, did you try any of it?

-- e.g. try this instead of IN()
select *
from Table2 T2
where NOT EXISTS ( select 1 from Table1 T1 where T1.id = T2.id)
0
 

Author Comment

by:royjayd
ID: 39255015
PortletPaul,
thanks.
few clarifications

>>if the difference is just 10 then creating a new table of 1.1m rows is not warranted
the difference is 10 now but it could be anything, it could go up to 1000 or 100,000 or a million or more. We are anticipating huge difference as our application goes live into production.

I have also created index on Id.
we are using a batch process to delete 50000 rows in one batch. The batch process is controlled by a framework called (spring-batch from spring framework)

This is my sql right now

DELETE TOP (50000)
from table2 where
    ID not in (
        select ID from table1  )

the performance still not good , takes around an hour
but I will try your suggestion with not exist
DELETE TOP (50000)
from Table2 T2
where NOT EXISTS ( select 1 from Table1 T1 where T1.id = T2.id)

thanks
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 39255032
>>We are anticipating huge difference as our application goes live into production.
mmmmmm errrrrggghhhhhhhh

have you considered "not appending" logic so as to avoid a massive delete?
(just a thought)

is this a one-time thing or something that may/will occur often?
0
 

Author Comment

by:royjayd
ID: 39255057
This delete batch operation will run once an hour so possibly 24 times a day.

>>have you considered "not appending" logic so as to avoid a massive delete?
can you explain what you mean?

thanks.
0
 
LVL 35

Assisted Solution

by:David Todd
David Todd earned 140 total points
ID: 39255073
Hi,

Does your transaction log file share the same physical disks as the data file?

I suggest that you look into providing another mirrored pair of disks, and move the transaction log file to those disks, and keep those disks for just the log files - don't partition them, nothing else on that disks.

All that to say that I'm really surprised that deleting 50000 records takes an hour!

Are both id columns indexed? Did the transaction log need to grow in that time to accommodate a delete that big?

Yes size has its own problems, and anything small works pretty quickly most of the time, but it occurs to me that you have a real performance issue with this server, not just the problem with the delete although is your current pain point.

HTH
  David
0
 
LVL 48

Expert Comment

by:PortletPaul
ID: 39255103
"not appending"....

not sure where either of these tables come from or how they are generated, but you are treating one of these tables as a control, and the other as a source of 'new' (I think)

so instead of adding all rows to tables2, then comparing to table1, then deleting stuff...

only add the rows to the tables that are actually new

this is easy to write here - it may be dramatically inappropriate to your overall needs and/or I may be making way too many assumptions - just exploring alternates.
0
 

Author Comment

by:royjayd
ID: 39255122
Yes that is correct, one is a control table(table1) and the other is new(table2)
Table2 is the new table which is constructed from table1. First batch inserts are done in table2, then batch deletes. Table2 always needs to be in sync with table1.
Ok , so what you are saying is only add the rows to the table2 that are actually new...is it some thing like a conditional insert you are referring to?
Any SQL would help..
Thx.
0
 
LVL 31

Assisted Solution

by:awking00
awking00 earned 30 total points
ID: 39256652
>>select count(*) from
(select id from table2 except select id from table1) <<
I'm sorry. I always forget that sql server requires an alias for subqueries.
select count(*) from
(select id from table2 except select id from table1) as x
0
 
LVL 31

Expert Comment

by:awking00
ID: 39256679
You might also compare the performance of the except subquery with PortletPaul's suggested use of not exists.
select count(*)
from Table2 T2
where NOT EXISTS ( select 1 from Table1 T1 where T1.id = T2.id)
0
 

Author Comment

by:royjayd
ID: 39257011
i have used the NOT EXISTS sql as suggested and it works great.
thanks.
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

This is basically a blog post I wrote recently. I've found that SARGability is poorly understood, and since many people don't read blogs, I figured I'd post it here as an article. SARGable is an adjective in SQL that means that an item can be fou…
There have been several questions about Large Transaction Log Files in SQL Server 2008, and how to get rid of them when disk space has become critical. This article will explain how to disable full recovery and implement simple recovery that carries…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now