We help IT Professionals succeed at work.

Can adding dimensions for numerical fields be efficient?

ugeb
ugeb used Ask the Experts™
on
Hi,

A point of note is that I am not a database expert, just a knowledgeable developer and db user.  We are having an internal debate here regarding the value of using dimensions or not for numerical data.   Right now we have a hybrid of numerical values being stored directly in the fact table, and the domain of varchar values being stored in dimensions with an id stored in the fact table.

My colleague was arguing that it is easier to have one or the other for the purpose of writing queries that rely on a hierarchy (dimension), or not, but not both.  Dimensions for varchar-based data, and storing the numeric data without dimensions in the fact table.

Could it possibly make sense to store numerical values in a separate dimension?  For double precision, you would have possibly millions of entries, so are you really saving anything?  It seems the execution complexity of joining on a bunch of numerical fields far outweighs any simplicity gained from consistency in writing queries.

What do you think?

Thanks!

p.s. we are using a Vertica database -- a columnar db





Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2015

Commented:
Modern DBs will pad row(s) to block
In old foxpro one would add pad field with arbitrary long name to align lines with varchars to block.

Author

Commented:
I'm sorry, but I don't understand your comment at all.  Can you elaborate?
If I understand your original question correctl;y. you are asking if you should have a separate dimension table that would contain the complete "list" of, for instance, all possible double precision values with upto, say, 5 decimal places that you might need to use in your warehouse.  

IMHO, that would be a heck of a lot of over-kill and wasted space unless you are going to be using virtually all of them.  If you will be using a relatively small number of these values, then I would put the actual values in the Fact Table(s).  By having the actual value in the Fact Table(s), you would (among other things):
Save a lot of storage space that could be used for other purposes;
reduce the number of JOINs required to retrieve the data;
As a result of the previous point, you would improve the response time.
Top Expert 2008

Commented:
Fact tables are very big ... enourmos big.
Dimension tables are not so big.

If you are asking it is worth to save data from the dimension tables in the fact table the answer definitelly is NOT.

The fact table simply stores the pointer to the dimension (in most cases as Foreign Key that points to a Primary key in the dimension table.

Storing the data from the dimension table in the fact table you will do some bad things:

1. You will duplicate many data
2. You will face troubles when you need to update/delete/insert this data (the so called anomalies)


So due the normalization it is good to separate the two kind of tables.

In some special queries in the Data Warehouses you can denormalize the tables.

Oracle provides the Materialized Views that can be used for this ... but you have to measure 'pro' and 'contra' ... it depends on the volume of the DB, on the desired answer time ....
@schwertner,

The fact table simply stores the pointer to the dimension (in most cases as Foreign Key that points to a Primary key in the dimension table.

What would the relative size be of a table containing all possible double precision values to 5 decimal places?  Seems to me that such a table would give a whole new meaning to "huge" . . . and, what if the desire is "all possible double precision values" (i.e. without the limitation of the 5 decimal places)?  Does the difference in datatype storage between a BigInt and a Double/Float save as much space as would be required for the table and indexes associated with the dimension table for all double precision values considering the fact there will probably be a relatively small number of those values actually used in the Fact Tables?

Although, that last question is probably just a restatement of the original post. ;-)

Author

Commented:
Yes, this was my thought.  Let's say storage space is not the biggest issue, but performance is.  Our db consists of time series data so we will definitely have billions of rows at some point.  If we only used integers under, say 10,000, it could possibly be worth have those numbers from several fields as a dimension.  However, we will have double precision, and 5 or 6 decimal places seems reasonable for what we need to store.  While we could never store every double precision number, our ETL would come to a stand still on trying to map every number, integer or double to its index value.  Then the joins (or projections as Vertica's implementation utilizes) would be enormous.

I don't see how it would be reasonable to have such a huge dimension.  Again, our db is Vertica, a relatively new columnar db used much for time series data and is very fast.

Is there anybody who thinks this is a good idea (because I think it would kill our db)?

Thanks!

Top Expert 2008

Commented:
You have bilions of rows at same point.

If it is so you will have also bilions of redundant data entries if you denormalize the fact table.

Also take inaccount that possibly you will never select all of those bilions of rows. Only a part ....

But without detail information for the  database ( ERD, workflow) I can only guess and make common conclussions.
Top Expert 2008

Commented:
@ 8080_Diver

Please never use floating point numbers as indexes and for parent/child relations.

All possible double precission values?

I studied Mathematcs and one the first theorem was a  "Continium" theorem that says that between every two real numbers (no matter how close are they to each other) exist unlimited number of different real numbers. This is in pure mathematics. In the IT world only part of them could be represented in the computer, but nevertheles the number is huge.