What are the Reasons for NOT joining TWO fact tables? Need to educate others...
Posted on 2013-01-29
Our ETL guy, who is not familiar with Dimensional Modeling, insists on putting data into a new Fact table even though we have a current Fact table with the same grain and same associated Dimensions. His reasoning is that the current table is getting too large and wants to scale it down. The current table has about 1 million rows. The new data will be receiving about 10-20 million rows a day. It's clickstream data.
I would like to educate and explain why joining two FACT tables is not good design. I have written the following but I'm looking for any additional feedback. We do use SSAS and some tricks could be done there but I don't believe this is a good design. Would like to hear some of your answers to help me in educating our management and our ETL person...
Here is my current email...
"The whole point of a data warehouse is to quickly analyze the data and report on it. It is de-normalized, contains redundant data, and is optimized for data querying. In OLTP database, transactional Databases, the database is normalized, i.e. Third-Normal form, so we eliminate data redundancy, keep data integrity, and optimize it for transaction loading and updating.
In Dimensional Modeling we do not want to join Fact tables. The reporting needed is to compare different events... downloads to views, opens, firstteasers, etc. If we did it Sam's way, the data would be in two different tables hence needing a join which is not good. Automation will be difficult; manual queries would have to be done, not to mention the slowness of the querying. It misses the whole point of having the data warehouse.
As far as scalability… the size of database stays the same. The only difference is that one table will not grow as large. Since we are not querying the FactMessageEvent, I’m not sure why the ETL would be slower. If there are issues with the ETL we can discuss where the bottlenecks are and come up with solutions to address them such as looking at the indexes, dropping indexes for loads, optimizing queries, looking at the optimizer plans, partitioning the table, archiving some of the data. Is the optimizer doing tablespace scans?..., etc. I suggest to do this instead of coming up with a design that is not in accordance with Dimensional Modeling."
Any additions to the above are welcomed! Thanks!