Link to home
Start Free TrialLog in
Avatar of mooriginal
mooriginalFlag for United Kingdom of Great Britain and Northern Ireland

asked on

Archive data from 2 tables

I have 2 tables in a 3rd party product that the 3rd party has no defined scripts providable to help archive off the data. The DB is 2005.

The DB in question about 1 TB in size and request has come in to archive about 6 months of data.

Can someone help with a script to archive off according to date of 6 months from run date?

The field in question that stores the date is :
notification_date_time
Archive process in chunks of data per month to minimise impact to server.


Please see the relationship map attachment.

Thanks
Archive-process-tableDiag.jpg
Avatar of Jared_S
Jared_S

Why not just do a schedule a full backup during slow time of day, TEST RESTORING YOUR BACKUP, and then talk someone else into being the person who deletes data from the production database.

If you wanted to pull of 6 months of data from these two tables, you could do it with something like this:

SELECT * INTO archive_notification_entries 
FROM notification_entries 
WHERE notification_date_time between dateadd(m,-6,getdate()) and getdate()

SELECT a.* INTO archive_attachments 
FROM attachments a INNER JOIN archive_notification_entries n 
ON i.id = n.id

Open in new window


The new tables could be backed up.
You could run a delete command against your production databases (if you dare), using
intersect (that way if someone had updated a field after your 6 months snapshot, you wouldn't delete the row from production). These tables could be dropped.

BUT you would have to be very careful that the backup was validated properly. You can use row counts and checksums to help validate the data, but the best way I know of is to actually restore the backup.

On more thing, just removing the records won't free up the allocated space.
You'll have to shrink the database to do that.  
Your indexes will probably be fragmented after removing 6 months of data and shrinking the database, so you should rebuild your indexes as well.
Avatar of mooriginal

ASKER

OK thats sounds like a good plan..
my thoughts

I was debating bcp out the tables to flat files and store them on archive folder.

The delete can be done out of hours or when users are off the system during a maintenance window so wouldnt need to do anything fancy like intersects.

So how would the syntax be for the delete from then ?

I can run a dbcc shrinkfile as part of the process and then a rebuild of the indexes during maintenance window.

This process is going to be very infrequent and would only be left to my team
Unfortunately the 3rd party has been less than helpful.
The intersect will let you compare the rows of the production table to the rows in the archive table for each column. The chances of someone changing a record between your export and your delete are probably low, but allowing for it isn't a bad idea.

This will delete all the rows in production that match perfectly in the archive table:

DELETE FROM notification_entries n
WHERE n.ID in
(SELECT a.ID FROM
(SELECT * FROM notification_entries
INTERSECT
SELECT * FROM archive_notification_entries ) a)

DELETE FROM attachments n
WHERE n.ID in
(SELECT a.ID FROM
(SELECT * FROM attachments
INTERSECT
SELECT * FROM archive_attachments ) a)
Avatar of Scott Pletcher
I wouldn't try to join tables with that many rows if it's possible to avoid it, because it will take a LONG time.

Determine your specific archive datetime, get the corresponding ID [since the table has the "standard", but often dopey, ID keys rather than some datetime],  then put triggers on the tables to prevent any activity to rows on or before that ID.

Is there an index on notification_date_time?


Is the data itself 1TB, or does that include the log size?

What % of the total table size is 6 mos worth of data?


How big is the log?

What are the log file parameters: total size, free space, filegrowth amount?

The biggest time gain for doing the DELETEs is to make sure the log space is preallocated and thus performatted.
Jeez the server which ive just go access to is a mess

1 data drive 5 TB
DB is 1.3 TB
Log - 333 MB
All file sit on same drive.

System DBs on C drive

Attachments table is 900 GB in size row count 3615437
Notification_entries is 452 GB in size row count - 3685889

No index on notification_date_time

I couldnt even run a select to show the number of entries for 6 months back ... as it just didnt return ...

The DB is set to autogrow 1 MB !!
Same for the log file
[I have since changed this as a minor change to 1 GB for data growth]

PK_Notification_entries on ID - clustered index
PK_Attachments on ID - clustered index
ive been running

SELECT *
FROM notification_entries
WHERE notification_date_time between dateadd(m,-1,getdate()) and getdate()

To see if I can return a months worth of data  - so far the query has been running for 2hrs and not finished.

Can anyone confirm a better way of doing the query that would be quicker ?
Yes.  It's a bit of a pain, but you need to do a binary search with the ids and a date check to determine the starting and ending ids for the date range desired, then use those id values to do the query itself.

If you need some sample code for that, just let me know.
The way the db is now, the query with a WHERE on dates requires the entire table -- all 452G of it -- to be scanned.  The WHERE has to be on ID, since that is what is indexed.
yes please send over a better way of doing this ..as its still running and not effectively as you mention it seems
ASKER CERTIFIED SOLUTION
Avatar of Scott Pletcher
Scott Pletcher
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks Scott

Excellent bit of code - Ive set this to run.

The other code didnt retun results after 6 hrs it failed with no more locks

As I said the server is a mess.

Im closing the call - because the dept has now decided to get the 3rd party in to clear all current data - and then start from afresh
so at least this process has allowed them to address the underlying non management of the server.
A fair bit of work went into script so good effort
You should have gotten a result set back from the code above in a reasonable time, yes?

If you need to, you can add NOLOCK to the final SELECT, to avoid locking issues:


SELECT *
FROM notification_entries WITH (NOLOCK) --<<--
WHERE
    ID >= @search_low_ID AND
    notification_date_time >= @notification_date_time_start AND
    notification_date_time < DATEADD(SECOND, 1, @notification_date_time_end)
ORDER BY
    ID
You could likely leave the data if you just had the proper clustered key; likely notification_date_time, but I can't be sure without seeing all the SQL metadata about the table (and depending on how long SQL has been continuously running to accumulate the metadata).
Thanks Scott - I was getting results back - and waiting on the script to finish - but there was no need to see results set after the decision was made...