What are the Reasons for NOT joining TWO fact tables? Need to educate others...

Posted on 2013-01-29
Last Modified: 2016-02-18
Our ETL guy, who is not familiar with Dimensional Modeling, insists on putting data into a new Fact table even though we have a current Fact table with the same grain and same associated Dimensions. His reasoning is that the current table is getting too large and wants to scale it down. The current table has about 1 million rows. The new data will be receiving about 10-20 million rows a day.  It's clickstream data.

I would like to educate and explain why joining two FACT tables is not good design.  I have written the following but I'm looking for any additional feedback.  We do use SSAS and some tricks could be done there but I don't believe this is a good design. Would like to hear some of your answers to help me in educating our management and our ETL person...

Here is my current email...
"The whole point of a data warehouse is to quickly analyze the data and report on it.  It is de-normalized, contains redundant data, and is optimized for data querying.  In OLTP database, transactional Databases, the database is normalized, i.e. Third-Normal form, so we eliminate data redundancy, keep data integrity, and optimize it for transaction loading and updating.

In Dimensional Modeling we do not want to join Fact tables.  The reporting needed is to compare different events... downloads to views, opens, firstteasers, etc.   If we did it Sam's way, the data would be in two different tables hence needing a join which is not good.  Automation will be difficult; manual queries would have to be done, not to mention the slowness of the querying.  It misses the whole point of having the data warehouse.

As far as scalability… the size of database stays the same.   The only difference is that one table will not grow as large.   Since we are not querying the FactMessageEvent, I’m not sure why the ETL would be slower.  If there are issues with the ETL we can discuss where the bottlenecks are and come up with solutions to address them such as looking at the indexes, dropping indexes for loads, optimizing queries, looking at the optimizer plans, partitioning the table, archiving some of the data.  Is the optimizer doing tablespace scans?..., etc.  I suggest to do this instead of coming up with a design that is not in accordance with Dimensional Modeling."

 Any additions to the above are welcomed!  Thanks!
Question by:Bodhi108
LVL 32

Expert Comment

by:Daniel Wilson
ID: 38833651
ETL is not my strongest area, but I'm not sure I like either proposal.

His idea that requires the join between the 1 million record table and the table that will soon have BILLIONS of records isn't good.  That join will CRAWL.

Now, do you really want to add 10-20 million records a day?  Can you summarize those and add a lot fewer per day and still get the answers you want?

The questions your 1 million record table can answer will slow WAY down if that table grows to a billion records.  If you cannot summarize the clickstream data, can you run a new fact table that has everything it needs and is HUGE ... along with your current one that is not so big?  Ask the new one only the questions it is needed for.  Ask the current set of questions to the current table.  They will partially duplicate each other ... but ... this isn't OLTP.

Author Comment

ID: 38833739
Yes, I need to ask a few more questions to the users.

According to the users, it can't be summarized in order to do the analysis.  But, what we can do is delete rows after a month and move the data to a summarized table.

Once it is in a cube, I believe it will be fast but it will take longer to process the cube.

The other solution is convincing them to keep the current table and create a new table at a summarized level.  I've already spoken to them about this and they seem to need it at this level.  I can ask a few more questions to see if we can get it to a summarized level.

Author Comment

ID: 38834211
So I just found out that we would be adding at the most 1-3 million records a week, not 10-20 million records a day.

So, basically, the answer for not joining Fact tables is the join would be so slow.  I thought there were additional reasons for not joining them.  Cause once it is in the cube, it would be fast.  The processing would be long.
Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

LVL 37

Assisted Solution

ValentinoV earned 450 total points
ID: 38834241
If the most-detailed level is only needed for the most recent period, like one month or perhaps one year, then I'd go for a solution with two tables: one as you currently have (which contains the summarized data if I understood you well?) and one with the details of the most recent period.  Part of your ETL would then take care of summarizing the outdated data into the summary table and removing the outdated data from the details table.  The details table would only be used when they are really needed, all other queries would use the summary table.  Doing that will still keep your reports fast and impact on the cube load would be reduced to a minimum.

However, if you do need to keep all details going back years, I believe you should look into partitioning, possibly using the sliding window technique.  I'm not an expert on that, but there are some good whitepapers available:
[MSDN] Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server
[MSDN] Partitioned Tables and Indexes in SQL Server 2005
[SQLCAT!] Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

Whatever you do, if you need both summarized and detail data, don't put all this in the same table because you'd be mixing records with different grain.  In that case it is actually preferred to have two fact tables with the possibility to link them together (but only if you have to when the detailed info is required).  Some more info to support what I just typed can be found here: [MSDN] Fact Tables (see last three paragraphs)

Author Comment

ID: 38834283
I really appreciate your answers and they are helpful although my question was not about the best design.  I did come up with most of the design strategies you have stated.  And I do like the articles posted which will be helpful for partitioning tables.

My question was why is it not good dimensional modeling to join 2 Fact tables?
LVL 37

Expert Comment

ID: 38834327
I see, no worries, good that you're already aware of all that in fact! :)

So I guess part of the answer to your question would be that in some occasions it might be interesting to actually be able to join two fact tables (summary<>detail).  Is it good?  Well, no, but it's not the end of the world either if that join is only used when detailed data is required. Obviously, if you can come up with the required data without that join, all the better!
LVL 45

Expert Comment

by:Kent Olsen
ID: 38835301
Hi Bodhi,

Think about the structure of the fact table.  It contains the raw data associated with each transaction, a primary key (usually an IDENTITY column) and specific columns with foreign keys into the dimension tables.  The only indexed columns in the fact table are usually the primary and foreign keys.  That pushes all of the indexing and I/O to the much smaller dimension tables.

It may be that several fact tables share items in the dimension tables.  Consider an insurance company.  Policies, premiums, and claims would likely all be separate fact tables as there is a lot of data specific to each item.  However, they would share several dimension tables.  There would be a dimension table with policy numbers (and probably renewals) common to all 3 fact tables.  The same date dimensions would apply to all 3.  etc.

Now consider a huge retail operation like Wal Mart or Home Depot.  Their data warehouses are tiered.  At the lowest level is the retail store.  Recording ALL of the daily sales information into a single fact table could be close to impossible so there could easily be a fact table and set of dimension tables for each store.  The dimension tables are common, at least in structure.  All of the stores need a time dimension.  All of the stores will track sales, inventory, loss, employees, etc.  Generating company-wide values suitable for the executive team or an annual statement from thousands of fact tables is silly, so there would be a higher level fact table where the store details are summarized.  Some of the same dimension tables would be applicable to both levels, some would not.

It's not uncommon to join results from different fact tables.  For the insurance company to determine their net profit they'd take the difference of the sums of the values in the premium and claim tables.  

If the fact tables contain a lot of common data, something may well be wrong in the design.  If the fact tables manage different data sets, it's probably fine.

The whole point of the data warehouse is end-user speed.  You trade hours of ETL time to shave time off of the queries.  It your ETL guy's approach doesn't achieve this, find out why.

Good Luck,
LVL 69

Accepted Solution

Scott Pletcher earned 50 total points
ID: 38835899
"Star" schemas, in which fact tables are never linked to each other, is almost certainly the most common OLAP design.

But "snowflake" schemas are also quite valid, in which fact tables do indeed join to other fact tables.

[For further info, you can Google "star schema" and/or "snowflake schema".]

So, no, you can't legitimately have an absolute rule that "fact tables never join to other fact tables".  It depends on the specific data and requirements.

Author Comment

ID: 38886166
The comments I received in another discussion I liked the best...
"First of all, you create a star schema in a dimensional model for analytics so you can quickly slice and dice or aggregate your data. Adding a second fact not only degrades the performance by the joins to each fact and dimension, but also breaks the capabilities mentioned in the first place. There are many ways to have high performance queries on large or even huge tables such as partitioning data, bitmap indexing, and memory partitions (for recent data that has high query rates), etc. As a data warehouse expert, I would recommend that he touches up on Kimbal white papers or “The Data Warehouse Lifecycle Toolkit” book. "

" I agree with all the comments made above. I think the main way to frame the discussion is what is the best design for reporting facts with the same grain. If they are separated into separate fact tables, the only choice for bringing them back together is an expensive SQL statement - e.g. join of two large tables, union, etc. As the gentleman above mentioned, there are ways in the DBMS to resolve performance issues (partitioning, etc.).

"The only way I could see even discussing this would be if the facts were coming from two different sources and were somehow completely different yet had the same dimensionality. Meaning, there was no known use case (or even imagined use case in the future) where the facts would come together (even on a dashboard). That seems highly unlikely, and I might still make the choice to keep them together and work on the performance via other means. Good luck with the discussion!"

"First, you should always model your facts on the lowest possible grain.
Second, yeah it's absolutly natural that you combine facts downstream (mostly fact's that have different grain's or capture different business events).
Third, if you work with conformed dimensions you can use them in your "second level / combined" facts.
Fourth, Try table or row compression for large fact and dimension tables (if you have I/O bound issues). I have seen decrease of ETL steps with 50% some times.
Fith, if you use SSAS -> install BIDS helper, and let it calculate the amount rows, you get very good view on your cube partitioning. My experiences with SQL table partitions are bad... But depends on the size of the fact tables.

Example of combining facts at my business:
We have a fact table that contains a snapshot of a mortgage contract part every financial period, but collections information is stored on the mortgage level (different (higher-rollup) granularity). The business users want to know which mortgages at each financial period are in late collections. So we aggregrate / roll-up the low level mortgage contract part fact to a mortage (second) level fact. (with only the comformed dimensions that are true to the grain) and then combine them (Left outer join) with the monthly collections snapshot fact.

And we do not "rebuild the fact" everytime, but partly load it every month with the new period. But sometimes this is not possible.

If you are having load times of 15 hours? You should check your hardware, a baseline Fastrack model (from HP for example +/- 30K dollars) loads 500 mil rows in +/- 30 minuts. We use DWH automation for ETL loading the datawarehouse"

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
SSRS 2016 KPI 2 32
Script to backup a Database Dayli on SQL Server Express 3 19
Parse this column 6 27
too many installs coming along with SQL 2016? 1 18
It is helpful to note: This is a cosmetic update and is not required, but should help your reports look better for your boss.  This issue has manifested itself in SSRS version 3.0 is where I have seen this behavior in.  And this behavior is only see…
If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
Although Jacob Bernoulli (1654-1705) has been credited as the creator of "Binomial Distribution Table", Gottfried Leibniz (1646-1716) did his dissertation on the subject in 1666; Leibniz you may recall is the co-inventor of "Calculus" and beat Isaac…
In a recent question ( here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question