Link to home
Start Free TrialLog in
Avatar of JLEmlet
JLEmlet

asked on

Deleting Orphaned Records Taking Forever

I have two tables that contain essentially the same information.  One is the staging table where data is transferred from the production database into the data warehouse.  The second table is the Fact table in the data warehouse.  The first step is to delete out any orphaned records in the Fact table - these being records that were deleted from the production table.  Each table has about 24,000,000 rows.  The ID column is indexed on both tables.  The query I'm using to delete is:

Delete t1
From FactTransaction t1 Left Join CleanTransaction t2 On t1.TransactionId = t2.DONATION_ID Where t2.DONATION_ID is null

Any help is appreciated.
Avatar of Member_2_861731
Member_2_861731
Flag of Canada image

Have you tried something like:

Delete FactTransaction
WHERE TransactionId  NOT IN(SELECT DISTINCT DONATION_ID FROM CleanTransaction)
Avatar of JLEmlet
JLEmlet

ASKER

Yes - and that took just as long.
How long is "long"?
You have to consider that your tables are somewhat kind of big, so it will take "some" time.
NOT EXISTS is another option, but with 24M rows, the bigger thing to note is the huge amount of transaction logging you are going to incur which may not be desired. With that said, you log has to grow to accommodate the need of this transaction.

You can probably try deleting in batches using TOP x PERCENT.

DELETE TOP (10) PERCENT
FROM FactTransaction
WHERE NOT EXISTS (
   SELECT 1
   FROM CleanTransaction
   WHERE CleanTransaction.DONATION_ID = FactTransaction.TransactionId
)
;

Hope that helps!
why are you deleting fact data?

normally the data warehouse contains "more" historical data that the current production system...

abd archive/deletion strategies would revolve around data expired over a certain period...

Can you not use a CDC (change data capture) approach to processing and only process the changes from
the production system ...?

what is the source system?

Not exists can perform better than Outer Join sometimes have you tried that?

delete from fact
where not exists (select id from staging where staging.id=fact.id)

a
delete Top (5000) from fact
where not exists (select id from staging where staging.id=fact.id)

strategy maybe the best approach ... looping around until no more deletes occur...
do you have a unit of work strategy for you ETL processing?



LFS, you make some nice points about the WHY this is being done in the first place. I am glad to see my suggestions on HOW match yours. :) But I think your initial questions on the why should be considered. I see now this is a "Fact Table." I was focused on the technical aspects of the deletion. The business and data retention needs of the data warehouse is likely more important, though.
I see 2 things: indexes possibly missing for the join condition

> On t1.TransactionId = t2.DONATION_ID
are you 200% sure that is the correct join condition ? :)
Avatar of JLEmlet

ASKER

Thanks to everybody for trying to help.

I'm deleting records because they are no longer valid and have been deleted from the production database.  I guess an option would be to mark them as inactive somehow, but would that take less time?  I'm positive that I have indexes on both the key fields in both tables.

Suggestions?
Avatar of JLEmlet

ASKER

Oh and "long" is on the order of 15 hours.
yes updating the records to mark them as inactive/removed will probably be quicker...
and may let you implement a more asynchronous approach to data archival/removal.

I'm deleting records because they are no longer valid and have been deleted from the production database.

may just be semantics , but unless you are the "business" owner , its up to the "business" to determine what the data/retention requirements are... Are you certain all "legal" requirements are being met?

just because the production system is pruned to a minimal working set doesn't always imply that all historical data can be removed... Is the warehouse a true warehouse or more of a departmental datamart?

you haven't clarified if you are doing the deletes in batches (as we suggest) or in just 1 huge transaction which is going to take forever.

another aspect is is this a one-off (large volume of deletions) or on-going?
what percentage of the rows will be deleted?
if its more than 5-10 percent then a re-org will be required...
and for those volumes it maybe better to consider other strategies for the data removal (ie unload,remove, load or dropping indexes , delete, re-create indexes)

Avatar of JLEmlet

ASKER

Thanks for responding.  I checked and there are only 108 records that need to be deleted.  I put those IDs into a temp table and created an index on the key.  Then I used the temp table to do the delete.  That is still taking a long time.  Thoughts?
Are you trying it using an EXISTS clause?

Try as we said:
DELETE
FROM FactTransaction
WHERE EXISTS (
   SELECT 1
   FROM TableWith108IDS lkup
   WHERE lkup.TransactionId = FactTransaction.TransactionId
)
;
Avatar of JLEmlet

ASKER

I tried that and it is still running over 2 hours.  When I just delete 1 record it takes a minute.  Why would it take so long.  There are 8 indexes and 5 foreign keys on the table.
have you looked at the execution plan that is being used?

can we see it?

are there any indexed views on the table?
are there any triggers on the table?

are other processes running against the database at the same time?
where physically is the data stored?
any other active processes on the server?
what are the server specs? (and physical or virtual?)
Avatar of JLEmlet

ASKER

I'm attaching the execution plan for deleting one record.  Looks like the index could be causing the problem?  There are indexed views on the table, but no triggers.  no other processes against the database at the same time.  All the data is on the physical server on a SAS array.  No other processes were active when I was testing yesterday.  the server is physical Dell R10, 24GB RAM, 5TB SCSI drives.
Avatar of JLEmlet

ASKER

Forgot to send the attachment.  Had to change the extension from sqlplan to txt.
Execution-plan.txt
That is a pretty complicated looking execution plan for the simple statement:
Delete FactTransaction Where TransactionId = 27237445

Open in new window


*laughing* There is probably more to look at here, but one thing is it is doing a nonclustered index scan on the transactionid and it shows an implicit conversion in the predicate. What data type is transactionid? Additionally are you using a @variable?
Avatar of JLEmlet

ASKER

Totally agree with you, mwvisa1.  TransactionId is a varchar(50) and Donation_Id is a varchar(255).  I can certainly make Donation_Id a varchar(50) if that would speed it up.
Avatar of JLEmlet

ASKER

And no, not using @variable.
See what the query plan and execution time looks like on this:
Delete 
From FactTransaction 
Where TransactionId = Convert(VarChar(50), 27237445);

Open in new window

Avatar of JLEmlet

ASKER

Here it is.
Exec-Plan-2.txt
Did that improve the performance from your perspective? I should be in the office shortly and can look at the plan itself...
Avatar of JLEmlet

ASKER

I didn't give that a shot yet.
I just looked at the query plan and I am seeing a warning about missing index. It appears to be on the FactTransactionDetail table. According to the report it says "The Query Processor estimates that implementing the following index could improve the query cost by 59.6374%."
ASKER CERTIFIED SOLUTION
Avatar of Kevin Cross
Kevin Cross
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of JLEmlet

ASKER

Thanks for all your help.  In the end, I modified the load program.