We help IT Professionals succeed at work.

We've partnered with Certified Experts, Carl Webster and Richard Faulkner, to bring you a podcast all about Citrix Workspace, moving to the cloud, and analytics & intelligence. Episode 2 coming soon!Listen Now

x

Question on datawarehousing with Sybase IQ

on
Medium Priority
692 Views
Hi,

Could you kindly explain what 'Facts' and 'Dimensions' are in relation to data warehousing?

I've come across these terms but i'm trying to understand what they mean, i'm totally lost to this concept...

THANKS
Comment
Watch Question

View Solution Only

Commented:
Facts and Dimensions are part of what is generally called the Dimensional Model for data warehousing.  You sometimes here it referred to as a "Star Schema" with the fact being the sun and the dimensions being planets.  The father and prime formalizer of this school is Ralph Kimball.  If you are involved with designing and or maintaining a data warehouse, you should have a copy of the seminal work, The Data Warehouse Toolkit.

Facts are generally irreducible pieces of data.  The dimensions are the normalizable and sometimes hierarchical details associated with the fact.  The "granularity" of the fact is defined by the dimensions that identify it.

As an example, let's take a typical retail star schema for a grocery store.  Lets further assume that the fact is each sale of each individual item.  So your dimensions might be:
- Period : there is always a hierarchical time dimension so that you can browse and group by certain periods, e.g. portion of the day, day of the week, weekly, monthly, holidays, etc.  This dimension should have every time/calendar specific piece of information like holidays (they sell a lot of food then).  Even though you may have the date and time to the milisecond, your period dimension probably has a granularity of say quarter hours.
- Store : assuming this is a chain and not just one store, you will want a dimension that identifies each.  The dimension should probably have geographic and organizational hierarchies, e.g. city, state, zip code, MMA, as well as zone and region that the company uses to organize stores.  A more sophisticated model might also include information about operating hours for example so that you can adjust reports to take that into account.
- Product : this will have a row for each possible SKU the store handles.  You will track manufacturer, size, may be case size; essentially every static piece of information about that product.
- Customer : if you have ever wondered why the grocery stores are so hot to get you to apply for a loyalty card, now you know.  The loyalty card is the best way for them to build a customer dimension, track demographics and do householding.  That is why they as things like your family income and whether you rent or own a home.  Even if you don't have a loyalty card, when you use a credit or debit card, they can attach a unique identifier that they have a high probability of seeing again the next time you purchase something.

So now your fact table has 4 surrogate keys (Period_ID, Store_ID, SKU, and Customer_ID), some "facts" like price, quantity, margin (if it is a really smart system), information about any coupon that was used, etc, and some details.  The details are generally non-numeric items (non-additive facts) that you still might want to know like the register ID, clerk ID, etc.

When you are loading data into the data warehouse, you have to make sure the dimensions are taken care of first since you will need the surrocate keys to load the fact.  In this example, Period and Store are pretty stable and are pre-loaded with known values.  Most of the Product table will be static and preloaded however, you have to be able to handle new and regional items on the fly.  These including things like microbrews that are purchased and handled by only a handful of stores.  Even if you have to go back later and fill in the details for the SKU, you have to create a dimension record and the surrogate key.  Customer will be a much more dynamic dimension with new identifiers all the time.  You may also have group customer records pre-populated to deal with those that can not be identified.  You can group them by method of payment, e.g. check, cash, food stamps, other, etc. and perhaps total sale buckets, e.g. \$0-\$5, \$5-\$10, \$10-\$20, \$20-\$50, \$50-\$75, \$75-\$100, \$100-\$150, \$150-----etc.

I could go on and write a book, or at least a few chapters, but this gives you the idea.  As I said before, the real way to understand this technique in depth is to get a copy of "The Data Warehouse Toolkit".  It is an easy read, and, unlike many technical books, is organized buy the type of business problem you are trying to solve more than the subsection of the technology.  Learning with practical examples you can actually identify with is much easier.

Regards,
Bill

Commented:
Thanks for that Bill!

Where can i get 'The Data Warehouse Toolkit' from? Is this free or is there a charge?

On a sidenote, going back to the question you answered earlier.....Is it fair to say that Sybase IQ's vertical storage is only good for single column "Selects" because it can pin-point only the necessary column/page as opposed to "Select All" columns??

Thanks again!
Commented:
"The Data Warehouse Toolkit" is available from Amazon.com and maybe stocked at your local book store.

No, that is not fair to say.  You are taking two extremes; select * versus select MyColumn, neither of which is very realistic, particularly in a data warehouse and analytic queries.

You are missing the purpose of a data warehouse.  It is not designed so you can pick out one complete record at a time.  That is just a simple repository and usually pretty worthless for doing analytics.

A data warehouse lets you answer questions about the data, not simply regurgitate rows.  It is pointless to structure a query against our fictitious grocery store sales DW in the form...

select *
from Period, Product, Customer, Store, Fact
where Fact.PID = Period.PID
and Fact.SKU = Product.SKU
and Fact.CustID = Customer.CustID
and Fact.StoreID = Store.StoreID

Even if you add clauses to reduce the amount of data, what you have is still just a pile of raw data and no value beyond having a flat file.  You can always go back to the POS system if you want to see an individual record, e.g. to handle a single customer complaint about an item.

The power of the data warehouse and OLAP (On Line Analytic Processing) is that you can quickly characterize the data and discover trends and patterns in mountains of data you could never manipulate one record at a time.  A famous data mining case study for a convenience store chain discovered there was a correlation between late evening beer and displosable diaper sales.  It turns out that wives send their husbands out to get diapers when they run out and, kind of as a consolation prize, the husbands pick up some beer while they are there.  The chain capitolized on this by putting premium beer on the end panels of the isles with the diapers.

The point is that nobody gets smarter by looking at individual rows of data.  It was the correlation that mattered.  Likewise, if you have a trend of customer complaints, you might look for a pattern to the geography, product(s), or cuatomer demographics.  Again, SELECT * is not going to be appropriate, much less helpful.

Regards,
Bill

Not the solution you were looking for? Getting a personalized solution is easy.

Thanks for using Experts Exchange.

• View three pieces of content (articles, solutions, posts, and videos)
• Ask the experts questions (counted toward content limit)
• Customize your dashboard and profile