We help IT Professionals succeed at work.

data warehouse solution for Open source database.

308 Views
Last Modified: 2020-11-01
HI,

I am comparing a data warehouse solution for MySQL/MariaDB, we are looking at :

1) Oracle Database 12c Data Warehouse
2) IBM BigInsights
3) Microsoft Azure
4) Pentaho Kettle / Other OSS

any idea?
Comment
Watch Question

CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
marrowyungSenior Technical architecture (Data)

Author

Commented:
no comparison based on function of each of the product I list above ?
marrowyungSenior Technical architecture (Data)

Author

Commented:
please also help on my ETL question , tks.
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
marrowyungSenior Technical architecture (Data)

Author

Commented:
"some company has compared them."

yeah, this group do not have data warehouse group.

but vendor will bad mouth each other, this is the concern.
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
marrowyungSenior Technical architecture (Data)

Author

Commented:
"  I tend to take what the others say and ask the competitor to confirm or deny it.  Most major vendors won't lie about it.  "


Here they usually don't do it as they worry about they get sue!
marrowyungSenior Technical architecture (Data)

Author

Commented:
in MySQL InnoDB cluster case, what MySQL (Oracle) come and demo is that , as MySQL do not have OLAP and data warehouse solution , the whole solution is to setup replication to clone the DB to another DB and setup a OLAP and data warehouse application and let this application setup everything on top of that DB database clone, I think it can work for MariaDB as well,

any kind of OLAP and data warehouse application you all can suggest ?
johnsoneSenior Oracle DBA
CERTIFIED EXPERT
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
marrowyungSenior Technical architecture (Data)

Author

Commented:
"Never trust demos."

not even the one from Oracle itself ?

" The demos are designed to showcase the product they are trying to sell and show you where they shine over the competitor. "

but they said MySQL DO NOT have OLAP and wareshouse solution !

MariaDB as AX solution but on friday, a vendor comes up and demo what is the solution for MariaDB and funny speaking, if need to migrate from Oracle OLAP to MariaDB solution, actually even as according MariaDB, they suggest tools !

informatics + Qview can be the solution, the plan can be, install informatics and connect to Oracle OLAP , then the tools will load all information from oracle OLAP and setup its own repository for OLAP and DW, then we can take out Oracle OLAP, BI answer and BI server.

"What you have to actually listen to is what they aren't telling you.  "

but how to find out ?
johnsoneSenior Oracle DBA
CERTIFIED EXPERT
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
marrowyungSenior Technical architecture (Data)

Author

Commented:
"I wouldn't trust a demo from Oracle.  They are trying to sell you something."

nono, they are saying they can't do it and the solution is, setup replication to create a clone and use third party tools to do the OLAP and DW.

what tools? they didn't say it !

they only say proxySQL to do SQL rounting with encryption but that one is open source. they don't charge us at all but just a suggestion on the OLAP solution.

"They get a piece of that too.  And, they also get the sale they were originally after.

yeah, but in MySQL case, proxySQL can't charge us ! and they said, this is a no support tools.

" There is usually a two person team, never ever trust the sales person, if the technical person says something about the product you can believe that, but they aren't allowed to tell you anything bad."

agree !

"The data and workflow they use in demos is specifically designed to make their product look the best. "

yeah, that's why we will PoC that using our data and way to work.
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT
Commented:
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION
marrowyungSenior Technical architecture (Data)

Author

Commented:
"have you considered bringing in your own hired gun to listen to the sales pitches and help you filter through them? "

yes, our whole team here listen to them.

we interviewed 2 x vendors and one of them really weak on MariaDB ! I am much far ahead of them. one of the problem we can see is, they just copy web content from MariaDB.com, not much sense they made.

but they have experience on this kind of project and there are projects here they did it before.

but on DB side, we both agree the Ispirer tools is good !
marrowyungSenior Technical architecture (Data)

Author

Commented:
Kent Olsen,

"Can you tell us a big more about what your data warehouse will look like?  Overall size, number of rows, et"
3 TB.

any suggestion
?
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
3TB is work-a-day it today's world.  Any 64-bit engine should handle it with ease.  (I can remember when one terrabyte of storage didn't exist and 1 terra-bit of storage required an entire room of the old 9-track tapes!)

Make sure that your server is beefy enough.  I'd start with 128G and 8, or 16 cores depending on how it will be used.  And it's probably a good idea for the server to be expandable (it's not maxed out in memory or cores).  Just the indexing will probably be several times the memory size, so the ability to cache more data/index may be critical for performance.  Again, usage is key here.  (If you're typically looking at the last 2 months data and the database has 10 years of data, most queries may run from cached values.

Again, it's probably most cost effective to piggy back on the database engine license that you already have.  If that's not a concern, check to see if any of the databases you're considering has (or doesn't have) some of the critical analytical query functionality (inline aggregation) that can simplify and speed up your queries.
marrowyungSenior Technical architecture (Data)

Author

Commented:
tks.

actually I am looking for comparsion between product and
I found this one: https://solutionsreview.com/dl/m/2018_2214371218_BI_Matrix.pdf

is that comparison good for data warehouse ?

"If that's not a concern, check to see if any of the databases you're considering has (or doesn't have) some of the critical analytical query functionality (inline aggregation) that can simplify and speed up your queries."

seems in this area, MS always stand in front of everyone in this 2 years,especially in Azure.
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
Hi Marrowyung,

I don't care for the document.  It looks like a propaganda that a marketing team will display as a Powerpoint presentation.  But there's no meat to this.  Nothing that discusses their methodologies, how the vendors were chosen, how the surveys were conducted, etc.  

It's pretty clear that they like Microsoft and ClearData,  Not sure how much faith that I'd have in the document.


Kent
marrowyungSenior Technical architecture (Data)

Author

Commented:
hi,

tks. but this doc suitable for data warehouse DB ? if it is NOT For data warehouse, I can't read it at all.
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
The packages and vendors listed suggest that it pertains to data warehouses and BI solutions.
marrowyungSenior Technical architecture (Data)

Author

Commented:
yes. tks. I think so, I just want to confirm this with the world most respectful expert.
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
:)
marrowyungSenior Technical architecture (Data)

Author

Commented:
very lucky that we now do not use data warehouse at all, but I think they don't know if they are using it or not.

so question now is, under what situtation MUST we use data warehouse? we here only use it for reporting but BI thing usually involve data warehouse right?
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
A data warehouse is "just a database" just as your production database is "just a database".  Where the two differ is in their logical design.  The data warehouse is designed to optimize reporting/querying, while the production database is designed to optimize writes.

From the business' perspective you don't need a data warehouse -- the business runs from the production database.  But if you have enough data, the data is complex enough, or reporting on the data interferes with normal database performance, a data warehouse is a good idea.

BI is often very consuming of database resources.  Depending on your BI needs you may need the data warehouse just because BI processing demands may be so great that it interferes with the performance of other database access.  Reading a single row from either should be trivial.  But BI operations often consume a lot of data.  A BI query based on the fiscal year could have to read through millions (or billions) of rows in the production database but a data warehouse that is designed to answer that query could run almost instantaneously.  (I built a data warehouse a few years ago and moved some production reports to it.  On the production system each report ran in 90-95 seconds.  On the data warehouse they ran in less than 1/10 second, or about 1/1000 of the original time.  There were 5 related reports so the total run time on the production database was over 7 1/2 minutes, under a second on the data warehouse.  A lot of things had to align for that kind of improvement, but it shows how much a data warehouse can improve your operations.)

Whenever I discuss a data warehouse with management I make sure to get two points across.  

1)  "This is just Algebra".  Accessing data from the database follows the rules and design of the database and we can compute the amount of work necessary to satisfy a query.  The production database (an OLTP) does more work to produce a report than does a data warehouse.

2)  What question(s) do you want answered from the data warehouse?  The reactions are usually predictable based on the audience.  Upper management usually wants to know how much money they're making.  "That's two questions" is my response.  "Income and Expenditures have completely different rules so we need to answer those two questions to answer your question".  Accounting and finance inherently know that and have much better questions.

That's a bit wordy, but "do we need a data warehouse" is complicated.  Only your management and users can answer that.
CERTIFIED EXPERT
Most Valuable Expert 2012
Distinguished Expert 2019

Commented:
Kent is 100% spot on.  I just wanted to add a little.

OLTP and DSS (Data warehouse) also have drastically different data models because they need to answer drastically different questions.  In DSS you normally see FACT and DIMENSION table structures.  I'll let you Google those and read up on them.  There is a LOT of information out there.

Normally OLTP performs the day to day business.  DSS answers questions that no one thought to ever ask.

In retail you collect A LOT of data that you might not think relates but it might and typically does.

For example:
On Wednesdays when it rains  do we sell more or productX when productY is on sale.  And do more right-handed males wearing green socks buy them?  How about the trend over the last 10 years?

Do dog owners buy more Coke than Pepsi?  Might want to have a sale on dog food and coke at the same time!

You normally find quirky relationships you never knew you had.

Can you get that from the OLTP system?  Possibly but imaging the time.  DSS systems are designed for quick access to this type of data.

You also might not keep 10 years of data in the OLTP system.  In the DSS system you typically do.
marrowyungSenior Technical architecture (Data)

Author

Commented:
Kent Olsen,

" (I built a data warehouse a few years ago and moved some production reports to it.  On the production system each report ran in 90-95 seconds.  On the data warehouse they ran in less than 1/10 second, or about 1/1000 of the original time.  There were 5 related reports so the total run time on the production database was over 7 1/2 minutes, under a second on the data warehouse.  "

so these 2x DB index are different ?

I don't think you are only talking about adding index to data warehouse DB, right? will have sth like 3rd normal form to 2nd and 1st formal format?

snapflow schema / start schema ?

" Only your management and users can answer that."

I think finally they will say, tell us if we need one as we are not technical person.
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
"so these 2x DB index are different ?"

It's not just indexing.  The OLTP (production) database that I referenced had so many indexes that they took up more space than the actual data.  Adding a few more indexes wouldn't have helped, but a data warehouse design did.

The OLTP is typically very normalized, with a target design somewhere close the 3rd normal form.  The big exception is any log table where activity is recorded.  It's typically completely denormalized and is a snapshot of the relevant data at the time of the transaction.  (If the log table was normalized, changes in any data table would appear in the log table when the child tables were joined, making the time based snapshot wrong.)  

As odd as this sounds, the denormalized transaction log is often the centerpiece of the data warehouse.  A single record (often called a "fact" record) contains every data item related to the transaction.  This table becomes a data warehouse when the supporting tables and indexes are created.  The speed in reading data and reporting comes from the design and indexing of the supporting tables (often called "dimension" tables).  

There are typically no indexes on the user data in the fact table.  The speed comes from several places.  Data isn't read from a fact table until all of the references to the dimension tables are done.  Relatively small amounts of user data are cached in exchange for a large percentage of indexes being cached so seemingly unrelated queries benefit from using just one common dimension, such as time.  

The data warehouse can read user data from a single fact table where the OLTP has to join and process multiple tables.  Imagine a report that has to process 10,000,000 records.  Reading that many rows is not trivial and the data warehouse will take a measurable amount of time to assemble that data, but once the indexed dimension tables are processed, it's a simple case of reading the data.  A normalized OLTP has to do a lot more work due to the data structure and joining of the relevant tables.  Joining two tables results in a temporary table (sometimes called a derived table -- the derived table isn't indexed so every use of it requires a full table scan.)  Joining 4 tables to generate those 10M rows will result in  3 derived tables.  If after the 1st join there are 1M rows, after the second join 5M rows, and 10M rows after the 3rd, the OLTP will have written 16M rows into derived tables and done 2 full table scans over 6M rows, in addition to all of the overhead of processing the data in the user tables.
marrowyungSenior Technical architecture (Data)

Author

Commented:
"The OLTP (production) database that I referenced had so many indexes that they took up more space than the actual data. "

usually it is ! a lot of unused index.


" A normalized OLTP has to do a lot more work due to the data structure and joining of the relevant tables.  Joining two tables results in a temporary table (sometimes called a derived table -- the derived table isn't indexed so every use of it requires a full table scan.) "

yeah, that's why for all inner join we have to tune it, but very fortune that tuning a join is not hard !

denormalizate make it much less join ! just read from one single table

" If after the 1st join there are 1M rows, after the second join 5M rows, and 10M rows after the 3rd, the OLTP will have written 16M rows into derived tables and done 2 full table scans over 6M rows, in addition to all of the overhead of processing the data in the user tables."

I agree.

other than much less inner join, anything else data warehouse can help on ?
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
"other than much less inner join, anything else data warehouse can help on ? "

Choosing and filtering data in a data warehouse is usually done by joining fact and dimension tables with inner joins.  The tables and data relationships are structured so that joins don't result in Cartesian products and each join reduces the number of selected rows.

Internally, an index consists of the value being indexed and a database key that gets around all user and language constraints.  It references the exact location of the data.  (The key typically contains 3 values, the file number (location of the file identifier in the DBMS internal tables), the block number (the location in the file that needs to be read or paged in), and the slot number (the position in the block).)  Joining two dimension tables to a fact table actually results in the dimension tables being joined, despite them having no common data points.  The derived table contains the database keys that are common to the selections within the two dimension tables.  The query completes by selecting the data indicated by the list of database keys.  

That's where the power (and speed) of a data warehouse originates.  The DBMS joins tables with no common user data on the internal value of the data location (the database key), filtering by their indexes, and produces a derived table of database keys that are essentially 48 to 64 bit integers.

Modern DBMS have incorporated that technique into OLTP queries as much as possible, but the OLTP structure just isn't as efficient for querying and reporting as the data warehouse.  

You may be able to tune individual queries in your OLTP to be better, but the underlying database structure means that they will never be as efficient as queries from a properly designed data warehouse.
marrowyungSenior Technical architecture (Data)

Author

Commented:
"That's where the power (and speed) of a data warehouse originates"

you are saying in DW, The derived table contains the database keys even there are common to the selections within the two dimension tables?

so derived table also has index, which no the case of OLTP?
johnsoneSenior Oracle DBA
CERTIFIED EXPERT

Commented:
A data warehouse typically has very few (if any) indexes.

I used to work for a company that built data warehouse appliances.  Their system didn't even have the capability of creating an index.  They weren't needed or useful.
Kent OlsenData Warehouse / Database Architect
CERTIFIED EXPERT

Commented:
"you are saying in DW, The derived table contains the database keys even there are common to the selections within the two dimension tables?"

That was kind of a short explanation that needs a white board to do correctly.  :)

Assume your database has critical dimensions of time/date (most do) and customer code.  Part of the indexes for each of the dimension tables could be:

'2010/01/02' (11, 6, 2)
'2010/01/03' (11, 6, 3)
'2010/01/03' (11, 12, 1)
'2010/01/04' (11, 19, 3)

and

'CUST01'  (11, 4, 1)
'CUST02'  (11, 12, 1)
'CUST02'  (11, 14, 3)
'CUST03' (11, 18, 4)

The numbers in parentheses are the database keys (file number, block number, slot).  The query

SELECT * 
FROM SalesFacts SF
INNER JOIN TimeDim TD
  ON SF.TimeID = TD.TimeID
INNER JOIN CustomerDim CD
  ON SF.CustomerID = CD.CustomerID
WHERE date = '2010/01/03'
  AND CD.Code = 'CUST02'

Open in new window


The DBMS joins the dimension tables (Time and Customer) on the database keys.  Duplicates of date ('2010/01/03') and customer code 'CUST02' resolve themselves because no two data rows will have the same database key.  When the join and filter are complete the database keys of the desired data are in the internal derived table and it's a simple matter of reading the relevant data.  If the DBMS performs filtered joins, (a join based on the query filter), this can be incredibly fast.

Now imagine the execution of a similar query on an OLTP.  The query will have to join several data tables due to the normalized design.  Some of the joins may have to be done before the filtering, others may occur after, but the overhead is much greater than what the data warehouse will have to do.
marrowyungSenior Technical architecture (Data)

Author

Commented:
"That was kind of a short explanation that needs a white board to do correctly.  :)"

Yeah, draw diagram please. :):)

and yes again you refer to a join statement on DW and OLTP !  Join join and join ! knew you meant ! just like select distinct and group by in OLTP operation.
marrowyungSenior Technical architecture (Data)

Author

Commented:
Now I am checking DB warehouse feature from https://solutionsreview.com/dl/m/2018_2214371218_BI_Matrix.pdf

it seems Oracle is better, but I am not sure, sth in the report seems not accurate.

MS do not have the following ?

1) data management
2) Real- time monitoring.
3) scorecards.
4) search-Based BI
5) Data blending
6) data management


is that true ?
marrowyungSenior Technical architecture (Data)

Author

Commented:
tks all. 

Gain unlimited access to on-demand training courses with an Experts Exchange subscription.

Get Access
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Empower Your Career
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE

Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions