Link to home
Start Free TrialLog in
Avatar of Bodhi108
Bodhi108

asked on

What are the Reasons for NOT joining TWO fact tables? Need to educate others...

Our ETL guy, who is not familiar with Dimensional Modeling, insists on putting data into a new Fact table even though we have a current Fact table with the same grain and same associated Dimensions. His reasoning is that the current table is getting too large and wants to scale it down. The current table has about 1 million rows. The new data will be receiving about 10-20 million rows a day.  It's clickstream data.

I would like to educate and explain why joining two FACT tables is not good design.  I have written the following but I'm looking for any additional feedback.  We do use SSAS and some tricks could be done there but I don't believe this is a good design. Would like to hear some of your answers to help me in educating our management and our ETL person...

Here is my current email...
"The whole point of a data warehouse is to quickly analyze the data and report on it.  It is de-normalized, contains redundant data, and is optimized for data querying.  In OLTP database, transactional Databases, the database is normalized, i.e. Third-Normal form, so we eliminate data redundancy, keep data integrity, and optimize it for transaction loading and updating.

In Dimensional Modeling we do not want to join Fact tables.  The reporting needed is to compare different events... downloads to views, opens, firstteasers, etc.   If we did it Sam's way, the data would be in two different tables hence needing a join which is not good.  Automation will be difficult; manual queries would have to be done, not to mention the slowness of the querying.  It misses the whole point of having the data warehouse.

As far as scalability… the size of database stays the same.   The only difference is that one table will not grow as large.   Since we are not querying the FactMessageEvent, I’m not sure why the ETL would be slower.  If there are issues with the ETL we can discuss where the bottlenecks are and come up with solutions to address them such as looking at the indexes, dropping indexes for loads, optimizing queries, looking at the optimizer plans, partitioning the table, archiving some of the data.  Is the optimizer doing tablespace scans?..., etc.  I suggest to do this instead of coming up with a design that is not in accordance with Dimensional Modeling."

 Any additions to the above are welcomed!  Thanks!
Avatar of Daniel Wilson
Daniel Wilson
Flag of United States of America image

ETL is not my strongest area, but I'm not sure I like either proposal.

His idea that requires the join between the 1 million record table and the table that will soon have BILLIONS of records isn't good.  That join will CRAWL.

Now, do you really want to add 10-20 million records a day?  Can you summarize those and add a lot fewer per day and still get the answers you want?

The questions your 1 million record table can answer will slow WAY down if that table grows to a billion records.  If you cannot summarize the clickstream data, can you run a new fact table that has everything it needs and is HUGE ... along with your current one that is not so big?  Ask the new one only the questions it is needed for.  Ask the current set of questions to the current table.  They will partially duplicate each other ... but ... this isn't OLTP.
Avatar of Bodhi108
Bodhi108

ASKER

Yes, I need to ask a few more questions to the users.

According to the users, it can't be summarized in order to do the analysis.  But, what we can do is delete rows after a month and move the data to a summarized table.

Once it is in a cube, I believe it will be fast but it will take longer to process the cube.

The other solution is convincing them to keep the current table and create a new table at a summarized level.  I've already spoken to them about this and they seem to need it at this level.  I can ask a few more questions to see if we can get it to a summarized level.
So I just found out that we would be adding at the most 1-3 million records a week, not 10-20 million records a day.

So, basically, the answer for not joining Fact tables is the join would be so slow.  I thought there were additional reasons for not joining them.  Cause once it is in the cube, it would be fast.  The processing would be long.
SOLUTION
Avatar of ValentinoV
ValentinoV
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I really appreciate your answers and they are helpful although my question was not about the best design.  I did come up with most of the design strategies you have stated.  And I do like the articles posted which will be helpful for partitioning tables.

My question was why is it not good dimensional modeling to join 2 Fact tables?
I see, no worries, good that you're already aware of all that in fact! :)

So I guess part of the answer to your question would be that in some occasions it might be interesting to actually be able to join two fact tables (summary<>detail).  Is it good?  Well, no, but it's not the end of the world either if that join is only used when detailed data is required. Obviously, if you can come up with the required data without that join, all the better!
Hi Bodhi,


Think about the structure of the fact table.  It contains the raw data associated with each transaction, a primary key (usually an IDENTITY column) and specific columns with foreign keys into the dimension tables.  The only indexed columns in the fact table are usually the primary and foreign keys.  That pushes all of the indexing and I/O to the much smaller dimension tables.

It may be that several fact tables share items in the dimension tables.  Consider an insurance company.  Policies, premiums, and claims would likely all be separate fact tables as there is a lot of data specific to each item.  However, they would share several dimension tables.  There would be a dimension table with policy numbers (and probably renewals) common to all 3 fact tables.  The same date dimensions would apply to all 3.  etc.

Now consider a huge retail operation like Wal Mart or Home Depot.  Their data warehouses are tiered.  At the lowest level is the retail store.  Recording ALL of the daily sales information into a single fact table could be close to impossible so there could easily be a fact table and set of dimension tables for each store.  The dimension tables are common, at least in structure.  All of the stores need a time dimension.  All of the stores will track sales, inventory, loss, employees, etc.  Generating company-wide values suitable for the executive team or an annual statement from thousands of fact tables is silly, so there would be a higher level fact table where the store details are summarized.  Some of the same dimension tables would be applicable to both levels, some would not.

It's not uncommon to join results from different fact tables.  For the insurance company to determine their net profit they'd take the difference of the sums of the values in the premium and claim tables.  

If the fact tables contain a lot of common data, something may well be wrong in the design.  If the fact tables manage different data sets, it's probably fine.

The whole point of the data warehouse is end-user speed.  You trade hours of ETL time to shave time off of the queries.  It your ETL guy's approach doesn't achieve this, find out why.


Good Luck,
Kent
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The comments I received in another discussion I liked the best...
"First of all, you create a star schema in a dimensional model for analytics so you can quickly slice and dice or aggregate your data. Adding a second fact not only degrades the performance by the joins to each fact and dimension, but also breaks the capabilities mentioned in the first place. There are many ways to have high performance queries on large or even huge tables such as partitioning data, bitmap indexing, and memory partitions (for recent data that has high query rates), etc. As a data warehouse expert, I would recommend that he touches up on Kimbal white papers or “The Data Warehouse Lifecycle Toolkit” book. "

" I agree with all the comments made above. I think the main way to frame the discussion is what is the best design for reporting facts with the same grain. If they are separated into separate fact tables, the only choice for bringing them back together is an expensive SQL statement - e.g. join of two large tables, union, etc. As the gentleman above mentioned, there are ways in the DBMS to resolve performance issues (partitioning, etc.).

"The only way I could see even discussing this would be if the facts were coming from two different sources and were somehow completely different yet had the same dimensionality. Meaning, there was no known use case (or even imagined use case in the future) where the facts would come together (even on a dashboard). That seems highly unlikely, and I might still make the choice to keep them together and work on the performance via other means. Good luck with the discussion!"

"First, you should always model your facts on the lowest possible grain.
Second, yeah it's absolutly natural that you combine facts downstream (mostly fact's that have different grain's or capture different business events).
Third, if you work with conformed dimensions you can use them in your "second level / combined" facts.
Fourth, Try table or row compression for large fact and dimension tables (if you have I/O bound issues). I have seen decrease of ETL steps with 50% some times.
Fith, if you use SSAS -> install BIDS helper, and let it calculate the amount rows, you get very good view on your cube partitioning. My experiences with SQL table partitions are bad... But depends on the size of the fact tables.

Example of combining facts at my business:
We have a fact table that contains a snapshot of a mortgage contract part every financial period, but collections information is stored on the mortgage level (different (higher-rollup) granularity). The business users want to know which mortgages at each financial period are in late collections. So we aggregrate / roll-up the low level mortgage contract part fact to a mortage (second) level fact. (with only the comformed dimensions that are true to the grain) and then combine them (Left outer join) with the monthly collections snapshot fact.

And we do not "rebuild the fact" everytime, but partly load it every month with the new period. But sometimes this is not possible.

If you are having load times of 15 hours? You should check your hardware, a baseline Fastrack model (from HP for example +/- 30K dollars) loads 500 mil rows in +/- 30 minuts. We use DWH automation for ETL loading the datawarehouse"