Deleting Orphaned Records Taking Forever

I have two tables that contain essentially the same information.  One is the staging table where data is transferred from the production database into the data warehouse.  The second table is the Fact table in the data warehouse.  The first step is to delete out any orphaned records in the Fact table - these being records that were deleted from the production table.  Each table has about 24,000,000 rows.  The ID column is indexed on both tables.  The query I'm using to delete is:

Delete t1
From FactTransaction t1 Left Join CleanTransaction t2 On t1.TransactionId = t2.DONATION_ID Where t2.DONATION_ID is null

Any help is appreciated.
JLEmletAsked:
Who is Participating?
 
Kevin CrossConnect With a Mentor Chief Technology OfficerCommented:
Okay as an update on above, I looked at the execution plan and it has a number of Nested Loops that are trying to iterate over all the tables that have foreign keys back to the FactTransaction table you are trying to delete. Most of the tables have a Clustered Index Seek. The only one that doesn't is the FactTransactionDetail table which is currently doing a Clustered Index Scan. That is why the missing index on the TransactionId is coming up and with such a big improvement to the query performance.

As stated earlier, if you have other data associated to these rows, that would be more of a reason to set an archive bit or deactivation date versus deleted.

Hope that clarifies everything and helps!
0
 
LIONKINGCommented:
Have you tried something like:

Delete FactTransaction
WHERE TransactionId  NOT IN(SELECT DISTINCT DONATION_ID FROM CleanTransaction)
0
 
JLEmletAuthor Commented:
Yes - and that took just as long.
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
LIONKINGCommented:
How long is "long"?
You have to consider that your tables are somewhat kind of big, so it will take "some" time.
0
 
Kevin CrossChief Technology OfficerCommented:
NOT EXISTS is another option, but with 24M rows, the bigger thing to note is the huge amount of transaction logging you are going to incur which may not be desired. With that said, you log has to grow to accommodate the need of this transaction.

You can probably try deleting in batches using TOP x PERCENT.

DELETE TOP (10) PERCENT
FROM FactTransaction
WHERE NOT EXISTS (
   SELECT 1
   FROM CleanTransaction
   WHERE CleanTransaction.DONATION_ID = FactTransaction.TransactionId
)
;

Hope that helps!
0
 
LowfatspreadCommented:
why are you deleting fact data?

normally the data warehouse contains "more" historical data that the current production system...

abd archive/deletion strategies would revolve around data expired over a certain period...

Can you not use a CDC (change data capture) approach to processing and only process the changes from
the production system ...?

what is the source system?

Not exists can perform better than Outer Join sometimes have you tried that?

delete from fact
where not exists (select id from staging where staging.id=fact.id)

a
delete Top (5000) from fact
where not exists (select id from staging where staging.id=fact.id)

strategy maybe the best approach ... looping around until no more deletes occur...
do you have a unit of work strategy for you ETL processing?



0
 
Kevin CrossChief Technology OfficerCommented:
LFS, you make some nice points about the WHY this is being done in the first place. I am glad to see my suggestions on HOW match yours. :) But I think your initial questions on the why should be considered. I see now this is a "Fact Table." I was focused on the technical aspects of the deletion. The business and data retention needs of the data warehouse is likely more important, though.
0
 
Guy Hengel [angelIII / a3]Billing EngineerCommented:
I see 2 things: indexes possibly missing for the join condition

> On t1.TransactionId = t2.DONATION_ID
are you 200% sure that is the correct join condition ? :)
0
 
JLEmletAuthor Commented:
Thanks to everybody for trying to help.

I'm deleting records because they are no longer valid and have been deleted from the production database.  I guess an option would be to mark them as inactive somehow, but would that take less time?  I'm positive that I have indexes on both the key fields in both tables.

Suggestions?
0
 
JLEmletAuthor Commented:
Oh and "long" is on the order of 15 hours.
0
 
LowfatspreadCommented:
yes updating the records to mark them as inactive/removed will probably be quicker...
and may let you implement a more asynchronous approach to data archival/removal.

I'm deleting records because they are no longer valid and have been deleted from the production database.

may just be semantics , but unless you are the "business" owner , its up to the "business" to determine what the data/retention requirements are... Are you certain all "legal" requirements are being met?

just because the production system is pruned to a minimal working set doesn't always imply that all historical data can be removed... Is the warehouse a true warehouse or more of a departmental datamart?

you haven't clarified if you are doing the deletes in batches (as we suggest) or in just 1 huge transaction which is going to take forever.

another aspect is is this a one-off (large volume of deletions) or on-going?
what percentage of the rows will be deleted?
if its more than 5-10 percent then a re-org will be required...
and for those volumes it maybe better to consider other strategies for the data removal (ie unload,remove, load or dropping indexes , delete, re-create indexes)

0
 
JLEmletAuthor Commented:
Thanks for responding.  I checked and there are only 108 records that need to be deleted.  I put those IDs into a temp table and created an index on the key.  Then I used the temp table to do the delete.  That is still taking a long time.  Thoughts?
0
 
Kevin CrossChief Technology OfficerCommented:
Are you trying it using an EXISTS clause?

Try as we said:
DELETE
FROM FactTransaction
WHERE EXISTS (
   SELECT 1
   FROM TableWith108IDS lkup
   WHERE lkup.TransactionId = FactTransaction.TransactionId
)
;
0
 
JLEmletAuthor Commented:
I tried that and it is still running over 2 hours.  When I just delete 1 record it takes a minute.  Why would it take so long.  There are 8 indexes and 5 foreign keys on the table.
0
 
LowfatspreadCommented:
have you looked at the execution plan that is being used?

can we see it?

are there any indexed views on the table?
are there any triggers on the table?

are other processes running against the database at the same time?
where physically is the data stored?
any other active processes on the server?
what are the server specs? (and physical or virtual?)
0
 
JLEmletAuthor Commented:
I'm attaching the execution plan for deleting one record.  Looks like the index could be causing the problem?  There are indexed views on the table, but no triggers.  no other processes against the database at the same time.  All the data is on the physical server on a SAS array.  No other processes were active when I was testing yesterday.  the server is physical Dell R10, 24GB RAM, 5TB SCSI drives.
0
 
JLEmletAuthor Commented:
Forgot to send the attachment.  Had to change the extension from sqlplan to txt.
Execution-plan.txt
0
 
Kevin CrossChief Technology OfficerCommented:
That is a pretty complicated looking execution plan for the simple statement:
Delete FactTransaction Where TransactionId = 27237445

Open in new window


*laughing* There is probably more to look at here, but one thing is it is doing a nonclustered index scan on the transactionid and it shows an implicit conversion in the predicate. What data type is transactionid? Additionally are you using a @variable?
0
 
JLEmletAuthor Commented:
Totally agree with you, mwvisa1.  TransactionId is a varchar(50) and Donation_Id is a varchar(255).  I can certainly make Donation_Id a varchar(50) if that would speed it up.
0
 
JLEmletAuthor Commented:
And no, not using @variable.
0
 
Kevin CrossChief Technology OfficerCommented:
See what the query plan and execution time looks like on this:
Delete 
From FactTransaction 
Where TransactionId = Convert(VarChar(50), 27237445);

Open in new window

0
 
JLEmletAuthor Commented:
Here it is.
Exec-Plan-2.txt
0
 
Kevin CrossChief Technology OfficerCommented:
Did that improve the performance from your perspective? I should be in the office shortly and can look at the plan itself...
0
 
JLEmletAuthor Commented:
I didn't give that a shot yet.
0
 
Kevin CrossChief Technology OfficerCommented:
I just looked at the query plan and I am seeing a warning about missing index. It appears to be on the FactTransactionDetail table. According to the report it says "The Query Processor estimates that implementing the following index could improve the query cost by 59.6374%."
0
 
JLEmletAuthor Commented:
Thanks for all your help.  In the end, I modified the load program.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.