Solved

Deleting Orphaned Records Taking Forever

Posted on 2011-09-05
26
335 Views
Last Modified: 2012-06-21
I have two tables that contain essentially the same information.  One is the staging table where data is transferred from the production database into the data warehouse.  The second table is the Fact table in the data warehouse.  The first step is to delete out any orphaned records in the Fact table - these being records that were deleted from the production table.  Each table has about 24,000,000 rows.  The ID column is indexed on both tables.  The query I'm using to delete is:

Delete t1
From FactTransaction t1 Left Join CleanTransaction t2 On t1.TransactionId = t2.DONATION_ID Where t2.DONATION_ID is null

Any help is appreciated.
0
Comment
Question by:JLEmlet
  • 12
  • 8
  • 3
  • +2
26 Comments
 
LVL 13

Expert Comment

by:LIONKING
ID: 36484789
Have you tried something like:

Delete FactTransaction
WHERE TransactionId  NOT IN(SELECT DISTINCT DONATION_ID FROM CleanTransaction)
0
 

Author Comment

by:JLEmlet
ID: 36484792
Yes - and that took just as long.
0
 
LVL 13

Expert Comment

by:LIONKING
ID: 36484832
How long is "long"?
You have to consider that your tables are somewhat kind of big, so it will take "some" time.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36484840
NOT EXISTS is another option, but with 24M rows, the bigger thing to note is the huge amount of transaction logging you are going to incur which may not be desired. With that said, you log has to grow to accommodate the need of this transaction.

You can probably try deleting in batches using TOP x PERCENT.

DELETE TOP (10) PERCENT
FROM FactTransaction
WHERE NOT EXISTS (
   SELECT 1
   FROM CleanTransaction
   WHERE CleanTransaction.DONATION_ID = FactTransaction.TransactionId
)
;

Hope that helps!
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36484849
why are you deleting fact data?

normally the data warehouse contains "more" historical data that the current production system...

abd archive/deletion strategies would revolve around data expired over a certain period...

Can you not use a CDC (change data capture) approach to processing and only process the changes from
the production system ...?

what is the source system?

Not exists can perform better than Outer Join sometimes have you tried that?

delete from fact
where not exists (select id from staging where staging.id=fact.id)

a
delete Top (5000) from fact
where not exists (select id from staging where staging.id=fact.id)

strategy maybe the best approach ... looping around until no more deletes occur...
do you have a unit of work strategy for you ETL processing?



0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36484995
LFS, you make some nice points about the WHY this is being done in the first place. I am glad to see my suggestions on HOW match yours. :) But I think your initial questions on the why should be considered. I see now this is a "Fact Table." I was focused on the technical aspects of the deletion. The business and data retention needs of the data warehouse is likely more important, though.
0
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 36485037
I see 2 things: indexes possibly missing for the join condition

> On t1.TransactionId = t2.DONATION_ID
are you 200% sure that is the correct join condition ? :)
0
 

Author Comment

by:JLEmlet
ID: 36485144
Thanks to everybody for trying to help.

I'm deleting records because they are no longer valid and have been deleted from the production database.  I guess an option would be to mark them as inactive somehow, but would that take less time?  I'm positive that I have indexes on both the key fields in both tables.

Suggestions?
0
 

Author Comment

by:JLEmlet
ID: 36485154
Oh and "long" is on the order of 15 hours.
0
 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36485418
yes updating the records to mark them as inactive/removed will probably be quicker...
and may let you implement a more asynchronous approach to data archival/removal.

I'm deleting records because they are no longer valid and have been deleted from the production database.

may just be semantics , but unless you are the "business" owner , its up to the "business" to determine what the data/retention requirements are... Are you certain all "legal" requirements are being met?

just because the production system is pruned to a minimal working set doesn't always imply that all historical data can be removed... Is the warehouse a true warehouse or more of a departmental datamart?

you haven't clarified if you are doing the deletes in batches (as we suggest) or in just 1 huge transaction which is going to take forever.

another aspect is is this a one-off (large volume of deletions) or on-going?
what percentage of the rows will be deleted?
if its more than 5-10 percent then a re-org will be required...
and for those volumes it maybe better to consider other strategies for the data removal (ie unload,remove, load or dropping indexes , delete, re-create indexes)

0
 

Author Comment

by:JLEmlet
ID: 36485546
Thanks for responding.  I checked and there are only 108 records that need to be deleted.  I put those IDs into a temp table and created an index on the key.  Then I used the temp table to do the delete.  That is still taking a long time.  Thoughts?
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36485569
Are you trying it using an EXISTS clause?

Try as we said:
DELETE
FROM FactTransaction
WHERE EXISTS (
   SELECT 1
   FROM TableWith108IDS lkup
   WHERE lkup.TransactionId = FactTransaction.TransactionId
)
;
0
 

Author Comment

by:JLEmlet
ID: 36485966
I tried that and it is still running over 2 hours.  When I just delete 1 record it takes a minute.  Why would it take so long.  There are 8 indexes and 5 foreign keys on the table.
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 50

Expert Comment

by:Lowfatspread
ID: 36486815
have you looked at the execution plan that is being used?

can we see it?

are there any indexed views on the table?
are there any triggers on the table?

are other processes running against the database at the same time?
where physically is the data stored?
any other active processes on the server?
what are the server specs? (and physical or virtual?)
0
 

Author Comment

by:JLEmlet
ID: 36487845
I'm attaching the execution plan for deleting one record.  Looks like the index could be causing the problem?  There are indexed views on the table, but no triggers.  no other processes against the database at the same time.  All the data is on the physical server on a SAS array.  No other processes were active when I was testing yesterday.  the server is physical Dell R10, 24GB RAM, 5TB SCSI drives.
0
 

Author Comment

by:JLEmlet
ID: 36487849
Forgot to send the attachment.  Had to change the extension from sqlplan to txt.
Execution-plan.txt
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36488443
That is a pretty complicated looking execution plan for the simple statement:
Delete FactTransaction Where TransactionId = 27237445

Open in new window


*laughing* There is probably more to look at here, but one thing is it is doing a nonclustered index scan on the transactionid and it shows an implicit conversion in the predicate. What data type is transactionid? Additionally are you using a @variable?
0
 

Author Comment

by:JLEmlet
ID: 36488472
Totally agree with you, mwvisa1.  TransactionId is a varchar(50) and Donation_Id is a varchar(255).  I can certainly make Donation_Id a varchar(50) if that would speed it up.
0
 

Author Comment

by:JLEmlet
ID: 36488473
And no, not using @variable.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36488651
See what the query plan and execution time looks like on this:
Delete 
From FactTransaction 
Where TransactionId = Convert(VarChar(50), 27237445);

Open in new window

0
 

Author Comment

by:JLEmlet
ID: 36489257
Here it is.
Exec-Plan-2.txt
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36489415
Did that improve the performance from your perspective? I should be in the office shortly and can look at the plan itself...
0
 

Author Comment

by:JLEmlet
ID: 36489426
I didn't give that a shot yet.
0
 
LVL 59

Expert Comment

by:Kevin Cross
ID: 36490207
I just looked at the query plan and I am seeing a warning about missing index. It appears to be on the FactTransactionDetail table. According to the report it says "The Query Processor estimates that implementing the following index could improve the query cost by 59.6374%."
0
 
LVL 59

Accepted Solution

by:
Kevin Cross earned 500 total points
ID: 36499855
Okay as an update on above, I looked at the execution plan and it has a number of Nested Loops that are trying to iterate over all the tables that have foreign keys back to the FactTransaction table you are trying to delete. Most of the tables have a Clustered Index Seek. The only one that doesn't is the FactTransactionDetail table which is currently doing a Clustered Index Scan. That is why the missing index on the TransactionId is coming up and with such a big improvement to the query performance.

As stated earlier, if you have other data associated to these rows, that would be more of a reason to set an archive bit or deactivation date versus deleted.

Hope that clarifies everything and helps!
0
 

Author Closing Comment

by:JLEmlet
ID: 36516488
Thanks for all your help.  In the end, I modified the load program.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

In this article I will describe the Detach & Attach method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
In this article we will get to know that how can we recover deleted data if it happens accidently. We really can recover deleted rows if we know the time when data is deleted by using the transaction log.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
This video shows how to remove a single email address from the Outlook 2010 Auto Suggestion memory. NOTE: For Outlook 2016 and 2013 perform the exact same steps. Open a new email: Click the New email button in Outlook. Start typing the address: …

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now