Data warehouse archiving


We are planning to make a Datawarehouse. For this the RAW data can imported from Flat Files, OLTP, XML data files etc etc.....and processed using ETL and then inserted onto DW.
Now suppose I need to archive datawarewouse every 2 months. that means on archiving, data for the last two months will be MOVED from the original location to some other location.

Then how can we determine while ETL process that the particular data has been archived and should not be imported again.?

I think this is a general issue and should have a standard solution.

Please guide...!

LVL 10
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Mark BradyPrincipal Data EngineerCommented:
If the data is being archived and saved as files, you could add to the filename with something like

$filename = "yourfilename.txt"; // whatever you name your file(s)
$archive_name = "a_".$filename;

now you can archive that file(s) and when you do your imports, don't import any file that begins with an "a_"

something along those lines would work.
Raja Jegan RSQL Server DBA & Architect, EE Solution GuideCommented:
Ok.. for your datawarehousing you would be importing or exporting records from a RDBMS say in your case SQL Server.
If this is the case, then you can have one table created in your database to store this information in that database itself so that the next time when you start your ETL process, just check for records in that table and then start the process excluding those date or files..

If it is not with respect to date, then apply some other logic to filter out those set of data..
Tim HumphriesDirectorCommented:
Hopefully you're planning to import data first to a staging area and then move this to the (restructured) database against which you will be reporting.

For Fact data, identify a unique transaction identifier for every row. When importing data, ignore rows that already exist in the destination database. It may be advantageous to maintain a 'transaction id lookup' table in the Staging area that holds just  the transaction ids imported, rather than referencing the final destination db for this purpose.

To assist this, when importing from a relational database, and when the transaction(s) being imported have a creation timestamp, store the highest timestamp imported in a control table that has a row for each table being imported. Your import process can then use this information to filter the data retrieved as having a timestamp greater than the last one imported.

For dimension data, there may be updates anyway, so re-importing may be entirely valid - simply update attributes that have changed (depending on the approach you have adopted towards managing change history for your dimension data).

Finally, the problem is partly one of managing data lineage. When importing, record a 'Batch no' which can be associated with the date and time of the import run. In your destination database store the batch id against each row inserted or updated. This doesn't help as regards re-importing data, but will help you identify where data reported on has originated from.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
10 Tips to Protect Your Business from Ransomware

Did you know that ransomware is the most widespread, destructive malware in the world today? It accounts for 39% of all security breaches, with ransomware gangsters projected to make $11.5B in profits from online extortion by 2019.

k_swapnilAuthor Commented:
Now suppose I archive 2008 data to one file (db) and 2009 to another and 2010 is in current data.

now if i want to generate a report in which I need data for last 3 years, how can i do this if I have only 2010 year data in current warehouse.?

Typically you would keep an aggregated version of the data online for a longer period of time so that you could produce reports against it even after the base data had been archived.
Raja Jegan RSQL Server DBA & Architect, EE Solution GuideCommented:
>> Now suppose I archive 2008 data to one file (db) and 2009 to another and 2010 is in current data.
now if i want to generate a report in which I need data for last 3 years, how can i do this if I have only 2010 year data in current warehouse.?

Kindly tell me the volume of data present in your database for a year say 2008, 2009 and 2010.
If it would be very less, then its advised to insert all records into a single database with tables partitioned yearwise.
Partitioning tables would help in that aspect.

If it is very large, then have all those records in different database yearwise.
Once stored yearwise, you can then create a view to merge records from multiple databases as required.
k_swapnilAuthor Commented:
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.