Link to home
Start Free TrialLog in
Avatar of ugeb
ugebFlag for United States of America

asked on

Can adding dimensions for numerical fields be efficient?

Hi,

A point of note is that I am not a database expert, just a knowledgeable developer and db user.  We are having an internal debate here regarding the value of using dimensions or not for numerical data.   Right now we have a hybrid of numerical values being stored directly in the fact table, and the domain of varchar values being stored in dimensions with an id stored in the fact table.

My colleague was arguing that it is easier to have one or the other for the purpose of writing queries that rely on a hierarchy (dimension), or not, but not both.  Dimensions for varchar-based data, and storing the numeric data without dimensions in the fact table.

Could it possibly make sense to store numerical values in a separate dimension?  For double precision, you would have possibly millions of entries, so are you really saving anything?  It seems the execution complexity of joining on a bunch of numerical fields far outweighs any simplicity gained from consistency in writing queries.

What do you think?

Thanks!

p.s. we are using a Vertica database -- a columnar db





Avatar of gheist
gheist
Flag of Belgium image

Modern DBs will pad row(s) to block
In old foxpro one would add pad field with arbitrary long name to align lines with varchars to block.
Avatar of ugeb

ASKER

I'm sorry, but I don't understand your comment at all.  Can you elaborate?
ASKER CERTIFIED SOLUTION
Avatar of 8080_Diver
8080_Diver
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of schwertner
Fact tables are very big ... enourmos big.
Dimension tables are not so big.

If you are asking it is worth to save data from the dimension tables in the fact table the answer definitelly is NOT.

The fact table simply stores the pointer to the dimension (in most cases as Foreign Key that points to a Primary key in the dimension table.

Storing the data from the dimension table in the fact table you will do some bad things:

1. You will duplicate many data
2. You will face troubles when you need to update/delete/insert this data (the so called anomalies)


So due the normalization it is good to separate the two kind of tables.

In some special queries in the Data Warehouses you can denormalize the tables.

Oracle provides the Materialized Views that can be used for this ... but you have to measure 'pro' and 'contra' ... it depends on the volume of the DB, on the desired answer time ....
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ugeb

ASKER

Yes, this was my thought.  Let's say storage space is not the biggest issue, but performance is.  Our db consists of time series data so we will definitely have billions of rows at some point.  If we only used integers under, say 10,000, it could possibly be worth have those numbers from several fields as a dimension.  However, we will have double precision, and 5 or 6 decimal places seems reasonable for what we need to store.  While we could never store every double precision number, our ETL would come to a stand still on trying to map every number, integer or double to its index value.  Then the joins (or projections as Vertica's implementation utilizes) would be enormous.

I don't see how it would be reasonable to have such a huge dimension.  Again, our db is Vertica, a relatively new columnar db used much for time series data and is very fast.

Is there anybody who thinks this is a good idea (because I think it would kill our db)?

Thanks!

You have bilions of rows at same point.

If it is so you will have also bilions of redundant data entries if you denormalize the fact table.

Also take inaccount that possibly you will never select all of those bilions of rows. Only a part ....

But without detail information for the  database ( ERD, workflow) I can only guess and make common conclussions.
@ 8080_Diver

Please never use floating point numbers as indexes and for parent/child relations.

All possible double precission values?

I studied Mathematcs and one the first theorem was a  "Continium" theorem that says that between every two real numbers (no matter how close are they to each other) exist unlimited number of different real numbers. This is in pure mathematics. In the IT world only part of them could be represented in the computer, but nevertheles the number is huge.