conditional delete on huge number of rows

hi guys

I have two tables
Table1 and Table2

Table1
ID
1
2
3
4
..and so on

Table2
ID
2
3
4
5
...and so on

I am comparing both the tables in following way:

If Id of Table2 does not exists in Id of Table1

delte that Id from Table2.

My logic is
I am getting a collection of Table1 Ids and Table2 Ids
Looping through each Table2 id
Check if each Table2 id exists in Table1 Id collection
If not exists delete the id using this sql:


DELETE FROM table2 WHERE id = :id

There are 1 million Id in Table1 and more than 1 million Id in Table2

My overall process takes a long time (more than an hour)
what could be the reason for this?

Right now ID column is not indexed. Will indexing help?


Any sugessions will be great.
thanks
royjaydAsked:
Who is Participating?
 
Surendra NathTechnology LeadCommented:
Right now ID column is not indexed. Will indexing help?
Yes it will help a lot

and use the below code for better performance

delete T2 
from Table2 T2
where not exists ( select 1 from Table1 T1 where T1.id = T2.id)

Open in new window


if the above code is also taking more time then that might be because of the huge transaction log that is being generated in the back end.
This can be avoided by employing batch deletes as below

declare @minID int,@maxID int,@batchCount INT,@CurrentID INT
select @minID = min(id) ,@maxid = max(id) from Table2
set @batchCount = 5000
SET @CurrentID = @minId
while  @CurrentID <= @MaxID
BEGIN
delete T2 
from Table2 T2
where not exists ( select 1 from Table1 T1 where T1.id = T2.id)
and T2.ID BETWEEN  @CurrentID AND @CurrentID+@batchCount
SET @CurrentID = @CurrentID + @batchCount
END

Open in new window

0
 
royjaydAuthor Commented:
actually i clicked on 'Display Estimated Execution plan' (in SQL Server Management Studio)
It does say
Missing Index : Create nonclustered index on table2.id

do i always follow what the execution plan says ?
0
 
Surendra NathTechnology LeadCommented:
Yes and No

The answer to your is it depends.
Creating an index might increase the performance of your query that is running right now...
But it might decrease the performance of an insert or update some where else.
And also remember indexes do take disk space, and it is unwise to have a lot indexes on a table.

Hope this helps out...
0
Cloud Class® Course: CompTIA Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

 
Jesus RodriguezIT ManagerCommented:
Why don't you do this

DELETE FROM T2
WHERE ID NOT IN (SELECT ID FROM T1)

If ID is a key in both tables will work pretty fast. I got a table with more than 4 millions of records and works fine.
0
 
PortletPaulfreelancerCommented:
regarding: IN()
use 'not exists' instead - as suggested by Neo_jarvis - it will perform better than IN()
0
 
David ToddSenior DBACommented:
Hi,

Something to bear in mind is that you are dealing with a few million rows here. This might start to be more than the server can handle at once and so it may pay to batch this.

Assuming that you are referring to a production system, and that that system is in full recovery mode, and even if in simple recovery mode, that delete could be one huge transaction that is logged in total  and bloats your transaction log.

An easy way for SQL 2008R2 is

delete top 1000
from t2
where
    id not in (
        select ID from t1
        )

Obviously that will need to be put in a loop. Exit condition for the loop could be as simple as checking the @@RowsAffected system variable, or doing a count of rows still to delete.

And don't forget about putting a checkpoint command in the loop.

Sure this will slow the delete down, and maybe you could increase the top 1000 some, but do you want a batch size of a million rows to choke at 99% because growth of the log file has used all available disk, and its now rolling back. And by the way, has an exclusive table lock preventing other users doing anything ...

HTH
  David
0
 
awking00Commented:
delete from table2 where id in
(select id from table2 except select id from table1)
And definitely create an index for id in both tables.
0
 
royjaydAuthor Commented:
i was googling and i found this

Instead of DELETE do this

select * into new_table from original_table where ...
go
drop table original_table
go
sp_rename 'new_table', 'original_table'

is that a better solution ?

thanks
0
 
awking00Commented:
>>is that a better solution ?<<
Probably not. Doing it that way will cause a loss of any indexing, referential constraints, etc.
If you must,
create the newtable as select * from originaltable where ...
truncate originaltable;
insert into originaltable select * from newtable;
drop newtable;
But I don't see how that would improve performance over the proposed solutions.
0
 
royjaydAuthor Commented:
i am not sure also  but the stats are pretty intresting

<<For a 24 million row table, this operation can run in less than two minutes on a standard modern server. Be aware you'd need to recreate indexes and you could break foreign keys>>>

Here is the discussion

http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/ecd6c649-3613-460d-9fec-2aa1483256e8/
0
 
David ToddSenior DBACommented:
Hi,

>> select into newtable
If what you want left is a small fraction of the original then this might be worth while - give the hassle of re-creating indexes and referential integrity.

Otherwise I suggest the delete top 10000 I mentioned before ...

Regards
  David
0
 
awking00Commented:
<<For a 24 million row table, this operation can run in less than two minutes on a standard modern server.>>
How do you know? Are indexes being used? Have statistics been gathered? How many processors? How much memory? Has logging been disabled?

<< Be aware you'd need to recreate indexes and you could break foreign keys>>>
Just as I stated.

How long does it take to run the following query?
select count(*) from
(select id from table2 except select id from table1)
0
 
royjaydAuthor Commented:
>>select count(*) from
(select id from table2 except select id from table1)

Is the sql correct ?
I get an error .. Incorrect syntax near ')'.

however this is some sample data i am running on:

select count(*) from table1
>1112653

select count(*) from table2
>1112663


select COUNT(*)
from table2 where
    ID not in (
        select ID from table1  )

>10

thx
0
 
PortletPaulfreelancerCommented:
if the difference is just 10 then creating a new table of 1.1m rows is not warranted

& NOT EXISTS will generally outperform use of IN()

I'm still recommending the first reply: ID: 39249210, did you try any of it?

-- e.g. try this instead of IN()
select *
from Table2 T2
where NOT EXISTS ( select 1 from Table1 T1 where T1.id = T2.id)
0
 
royjaydAuthor Commented:
PortletPaul,
thanks.
few clarifications

>>if the difference is just 10 then creating a new table of 1.1m rows is not warranted
the difference is 10 now but it could be anything, it could go up to 1000 or 100,000 or a million or more. We are anticipating huge difference as our application goes live into production.

I have also created index on Id.
we are using a batch process to delete 50000 rows in one batch. The batch process is controlled by a framework called (spring-batch from spring framework)

This is my sql right now

DELETE TOP (50000)
from table2 where
    ID not in (
        select ID from table1  )

the performance still not good , takes around an hour
but I will try your suggestion with not exist
DELETE TOP (50000)
from Table2 T2
where NOT EXISTS ( select 1 from Table1 T1 where T1.id = T2.id)

thanks
0
 
PortletPaulfreelancerCommented:
>>We are anticipating huge difference as our application goes live into production.
mmmmmm errrrrggghhhhhhhh

have you considered "not appending" logic so as to avoid a massive delete?
(just a thought)

is this a one-time thing or something that may/will occur often?
0
 
royjaydAuthor Commented:
This delete batch operation will run once an hour so possibly 24 times a day.

>>have you considered "not appending" logic so as to avoid a massive delete?
can you explain what you mean?

thanks.
0
 
David ToddSenior DBACommented:
Hi,

Does your transaction log file share the same physical disks as the data file?

I suggest that you look into providing another mirrored pair of disks, and move the transaction log file to those disks, and keep those disks for just the log files - don't partition them, nothing else on that disks.

All that to say that I'm really surprised that deleting 50000 records takes an hour!

Are both id columns indexed? Did the transaction log need to grow in that time to accommodate a delete that big?

Yes size has its own problems, and anything small works pretty quickly most of the time, but it occurs to me that you have a real performance issue with this server, not just the problem with the delete although is your current pain point.

HTH
  David
0
 
PortletPaulfreelancerCommented:
"not appending"....

not sure where either of these tables come from or how they are generated, but you are treating one of these tables as a control, and the other as a source of 'new' (I think)

so instead of adding all rows to tables2, then comparing to table1, then deleting stuff...

only add the rows to the tables that are actually new

this is easy to write here - it may be dramatically inappropriate to your overall needs and/or I may be making way too many assumptions - just exploring alternates.
0
 
royjaydAuthor Commented:
Yes that is correct, one is a control table(table1) and the other is new(table2)
Table2 is the new table which is constructed from table1. First batch inserts are done in table2, then batch deletes. Table2 always needs to be in sync with table1.
Ok , so what you are saying is only add the rows to the table2 that are actually new...is it some thing like a conditional insert you are referring to?
Any SQL would help..
Thx.
0
 
awking00Commented:
>>select count(*) from
(select id from table2 except select id from table1) <<
I'm sorry. I always forget that sql server requires an alias for subqueries.
select count(*) from
(select id from table2 except select id from table1) as x
0
 
awking00Commented:
You might also compare the performance of the except subquery with PortletPaul's suggested use of not exists.
select count(*)
from Table2 T2
where NOT EXISTS ( select 1 from Table1 T1 where T1.id = T2.id)
0
 
royjaydAuthor Commented:
i have used the NOT EXISTS sql as suggested and it works great.
thanks.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.