Data warehouse archiving

Hi,

We are planning to make a Datawarehouse. For this the RAW data can imported from Flat Files, OLTP, XML data files etc etc.....and processed using ETL and then inserted onto DW.
Now suppose I need to archive datawarewouse every 2 months. that means on archiving, data for the last two months will be MOVED from the original location to some other location.

Then how can we determine while ETL process that the particular data has been archived and should not be imported again.?

I think this is a general issue and should have a standard solution.

Please guide...!

Thx!
Swaps...
LVL 10
k_swapnilAsked:
Who is Participating?
 
Tim HumphriesConnect With a Mentor DirectorCommented:
Hopefully you're planning to import data first to a staging area and then move this to the (restructured) database against which you will be reporting.

For Fact data, identify a unique transaction identifier for every row. When importing data, ignore rows that already exist in the destination database. It may be advantageous to maintain a 'transaction id lookup' table in the Staging area that holds just  the transaction ids imported, rather than referencing the final destination db for this purpose.

To assist this, when importing from a relational database, and when the transaction(s) being imported have a creation timestamp, store the highest timestamp imported in a control table that has a row for each table being imported. Your import process can then use this information to filter the data retrieved as having a timestamp greater than the last one imported.

For dimension data, there may be updates anyway, so re-importing may be entirely valid - simply update attributes that have changed (depending on the approach you have adopted towards managing change history for your dimension data).

Finally, the problem is partly one of managing data lineage. When importing, record a 'Batch no' which can be associated with the date and time of the import run. In your destination database store the batch id against each row inserted or updated. This doesn't help as regards re-importing data, but will help you identify where data reported on has originated from.
0
 
Mark BradyPrincipal Data EngineerCommented:
If the data is being archived and saved as files, you could add to the filename with something like

$filename = "yourfilename.txt"; // whatever you name your file(s)
$archive_name = "a_".$filename;

now you can archive that file(s) and when you do your imports, don't import any file that begins with an "a_"

something along those lines would work.
0
 
Raja Jegan RConnect With a Mentor SQL Server DBA & ArchitectCommented:
Ok.. for your datawarehousing you would be importing or exporting records from a RDBMS say in your case SQL Server.
If this is the case, then you can have one table created in your database to store this information in that database itself so that the next time when you start your ETL process, just check for records in that table and then start the process excluding those date or files..

If it is not with respect to date, then apply some other logic to filter out those set of data..
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
k_swapnilAuthor Commented:
ok...
Now suppose I archive 2008 data to one file (db) and 2009 to another and 2010 is in current data.

now if i want to generate a report in which I need data for last 3 years, how can i do this if I have only 2010 year data in current warehouse.?

Thx!
Swaps....
0
 
dportasCommented:
Typically you would keep an aggregated version of the data online for a longer period of time so that you could produce reports against it even after the base data had been archived.
0
 
Raja Jegan RConnect With a Mentor SQL Server DBA & ArchitectCommented:
>> Now suppose I archive 2008 data to one file (db) and 2009 to another and 2010 is in current data.
now if i want to generate a report in which I need data for last 3 years, how can i do this if I have only 2010 year data in current warehouse.?

Kindly tell me the volume of data present in your database for a year say 2008, 2009 and 2010.
If it would be very less, then its advised to insert all records into a single database with tables partitioned yearwise.
Partitioning tables would help in that aspect.

If it is very large, then have all those records in different database yearwise.
Once stored yearwise, you can then create a view to merge records from multiple databases as required.
0
 
k_swapnilAuthor Commented:
Thx!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.