JohnAeris
asked on
Dimensional Model: How Can I Improve a Healthcare Call Centre Star Schema?
Hello experts!
The image shows part of a Dimensional Model design a colleague and I are working on. We are building a data warehouse and BI application using SQL Server and SSAS Tabular Model respectively.
Intro
My colleague settled on the above design because:
CCG Population and GP Population contain numeric data that will be used to calculate “Number of calls per 1000 population”
CCG & GP Population tables contains numeric data used in a measures and should be considered as fact tables.
This method avoids a monster dimension
I feel that there is perhaps a better approach because:
Fact Call Log is a business process, the other two fact tables are not.
GP Population and CCG Population should become dimensional attributes within DIM GP Practice.
In terms of super-simple user interface, I wish to have a single dimension holding this data.
The Challenge
DIM GP Practice is a dimension with many rows.GP Practice Name, GP Population and CCG population all change independently of each other at different times and frequencies.
Consequently, we need to track 3 changing variables in a single dimension – which may lead to an explosion in the number of rows in DIM GP Practice.
We also need to keep track of history, so the measure uses population figures in 2012 when viewing 2012 data and population figures for 2014 when viewing 2014 data and so forth.
Any thoughts on how best to refine this dimensional model?
Priorities are very easy user interface i.e. end-users have the smallest number of dimensions that they slice and dice by and ease of development to ensure rapid updates and changes going forward with minimal messing around with DAX statements.
Looking forward to your feedback!
Cheers!
Hi John,
Some things to start. :)
A) A data warehouse is an algebraic expression that is designed to answer a specific question. It may answer numerous questions, but it's core is a single answer.
2) Most data warehouses include a time dimension and utilize count() or sum() over time.
D) A star schema is typically a fact table with multiple dimensions around it. There can be multiple fact tables and the fact tables can "share" some/all of the dimension tables. I don't see that shown in your model.
With that as backdrop, it does look like you're trying to design the database around answering some key questions. :)
Your colleague's approach may well serve your stated needs, but it's not a true star schema. GP Population and CCG Population are dimensions (as you suggest) and describe attributes of the calls.
It is the call data that should be the fact table. When a new call comes in, it can be populated into the fact table as long as the dimension tables contain the items that describe the call. (The time dimension will have to be pre-populated, any "location" dimension will have to have the city, state, zip, for the caller location, etc.)
“Number of calls per 1000 population” is then computed by joining one (or both) of the population dimensions and the time dimension to the fact table (calls).
I believe that your approach is superior to your colleague's, and is much closer to being a true star schema.
Good Luck,
Kent
Some things to start. :)
A) A data warehouse is an algebraic expression that is designed to answer a specific question. It may answer numerous questions, but it's core is a single answer.
2) Most data warehouses include a time dimension and utilize count() or sum() over time.
D) A star schema is typically a fact table with multiple dimensions around it. There can be multiple fact tables and the fact tables can "share" some/all of the dimension tables. I don't see that shown in your model.
With that as backdrop, it does look like you're trying to design the database around answering some key questions. :)
Your colleague's approach may well serve your stated needs, but it's not a true star schema. GP Population and CCG Population are dimensions (as you suggest) and describe attributes of the calls.
It is the call data that should be the fact table. When a new call comes in, it can be populated into the fact table as long as the dimension tables contain the items that describe the call. (The time dimension will have to be pre-populated, any "location" dimension will have to have the city, state, zip, for the caller location, etc.)
“Number of calls per 1000 population” is then computed by joining one (or both) of the population dimensions and the time dimension to the fact table (calls).
I believe that your approach is superior to your colleague's, and is much closer to being a true star schema.
Good Luck,
Kent
ASKER
Hi Kent!
Thanks for your feedback.
The current model includes 30 Dimensions (including Time of day and Date) around a central fact table called CALL LOG. I removed this to make the image clearer.
“multiple fact tables and the fact tables can "share" some/all of the dimension tables”.
Agreed.
My colleague appears to be using the Fact CCG Population and Fact GP Population fact tables as “helper/lookup” tables in order to create the measures. As stated, it works, but I feel this is not the best approach.
“It is the call data that should be the fact table” – perhaps I didn’t make this clear, the Call Log Fact Table holds all the call data.
“I believe that your approach is superior to your colleague's, and is much closer to being a true star schema.” Thank you, not quite there yet though. :)
I still require assistance to create an alternative model that we can agree upon.
The key challenge:
Will it make more sense to separate that GP and CCG populations into a separate dimension even though they relate directly to GP and CCGs in the GP Practice dimension?
Any thoughts?
Thanks for your feedback.
The current model includes 30 Dimensions (including Time of day and Date) around a central fact table called CALL LOG. I removed this to make the image clearer.
“multiple fact tables and the fact tables can "share" some/all of the dimension tables”.
Agreed.
My colleague appears to be using the Fact CCG Population and Fact GP Population fact tables as “helper/lookup” tables in order to create the measures. As stated, it works, but I feel this is not the best approach.
“It is the call data that should be the fact table” – perhaps I didn’t make this clear, the Call Log Fact Table holds all the call data.
“I believe that your approach is superior to your colleague's, and is much closer to being a true star schema.” Thank you, not quite there yet though. :)
I still require assistance to create an alternative model that we can agree upon.
The key challenge:
CCG Population and GP population are updated at different times and quite frequently.
A CCG is made up of many GPs and this creates a natural hierarchy and so my initial thought is to incorporate all of GP, CCG and all populations in a single dimension.
However, if I do this, then we will end up with a dimension that changes 3 variables every month or so so it will grow exponentially.
Will it make more sense to separate that GP and CCG populations into a separate dimension even though they relate directly to GP and CCGs in the GP Practice dimension?
Any thoughts?
Can you describe the CCG Population and GP population tables a bit more? What data is contained in them, what is their relationship to other tables? What is the link between them? Etc...
ASKER
GP Population Output
Update Frequency: MonthlyNB > 10,000 rows of data, for about 500 GPs.
CCG Population output
Update Frequency: YearlyDIM GP Practice output
Update Frequency: Weekly
GP_Practice Code joins to GP_Practice Code in FACT GP Population.
GP_Practice_CCG_Code joins to CCG_Code in FACT CCG Population.
More Info
At present, FACT GP Population and FACT CCG Population are not displayed to users and are merely used as “helper / lookup tables” for population calculations, however, it breaks the star schema construct.
Also, Business Users would like to use CCG population as a report header / label. As it is not a dimension, this is not possible.
At present, we are using SSAS Tabular DAX to do the calculations. So it works, but I am looking for a better way to complete the dimensional modelling.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Kent,
Good call. GP Population and CCG Population are populated by the Central Data Warehouse Teams for other business units. We also require the same data, so we are using these tables.
FACT CCG Population joins to DIM GP Practice table via GP_Practice_CCG_Code.
It looks like the GP Population and CCG Population tables are summary tables of other data. Does the data derive from the CALL fact table or some place else?
Good call. GP Population and CCG Population are populated by the Central Data Warehouse Teams for other business units. We also require the same data, so we are using these tables.
What has to happen to treat the CCG Population table as a dimension? Does it join to the CALL fact table on anything other than time?
FACT CCG Population joins to DIM GP Practice table via GP_Practice_CCG_Code.
ASKER
Perhaps, I should look at an Outrigger or mini-dimension approach? Hmmm...
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Decided to stick with the original layout. This looks like a many-to-many issue, so, we're sticking with a TYPE 4 mini-dimension until we can figure out a more effect means.
THANKS!!!
THANKS!!!
ASKER
Definitions
TablesTerms