sabarish2u
asked on
Database design of large amounts of heterogeneous data
All ,
We are developing an application that collects data from different sources. The data could later used for visualization [Graphs] , Reporting and calculations [standard deviation, mean etc.]. The format of the data can be different . Some typical formats are
1) Key1:Value1
Key2:Value2
Key3:Value3
-----------------
2) Key1:Value1;Key2:Value2;Ke y3:Value3
3) Key1:Value1,Key2:Value2,Ke y3:Value,
4) Key1:Value1
Key2:Value2
Key3:Value3
the Key will be always string and the value can be double , integer etc.how do we design database for storing such heterogeneous data.... The database could be either SQL Server or Oracle or SQL Lite
We are developing an application that collects data from different sources. The data could later used for visualization [Graphs] , Reporting and calculations [standard deviation, mean etc.]. The format of the data can be different . Some typical formats are
1) Key1:Value1
Key2:Value2
Key3:Value3
-----------------
2) Key1:Value1;Key2:Value2;Ke
3) Key1:Value1,Key2:Value2,Ke
4) Key1:Value1
Key2:Value2
Key3:Value3
the Key will be always string and the value can be double , integer etc.how do we design database for storing such heterogeneous data.... The database could be either SQL Server or Oracle or SQL Lite
You will probably have to convert it all to strings. I suppose you could store it in binary blobs but then you have to write an application to get it in and out of the database and into forms that you can use.
As you said, design, You should convert them in strings like what dave had suggested,
If you have a segregation of sources, like in if there are (say)8 input systems then you can think of 8 interface tables to those sources and all the interface tables should have same structure.
As you said, they key is always a string, so you have to think of various possibilities of those string types, so that you can identify the columns in tables.
If you have a segregation of sources, like in if there are (say)8 input systems then you can think of 8 interface tables to those sources and all the interface tables should have same structure.
As you said, they key is always a string, so you have to think of various possibilities of those string types, so that you can identify the columns in tables.
On Oracle - If the values are always numeric, then just use NUMBER. If not, then consider using the ANYDATA type, which, as the name implies, will store any data type Oracle supports.
Another option, which would work for any platform, is to simply extend your table with columns for different types
Key, IntegerValue, DoubleValue, StringValue, DateValue, IntervalValue, GeoLocValue, etc
This would mean your queries would be slightly different for each type; but then they probably would anyway because you'd have to make accommodations for the data varying if all jammed into a single data type.
By using the correct type for each date value you'll not only get better efficiency in processing them but also an implicit constraint for data quality. You can't put a date into an integer; but if they were all strings, you could.
A downside to this method is you'll have declare all of your values to be nullable to account for only value being populated for a given row.
And of course the last option, which is probably best...
Simply create different Key,Value tables. One table for each data type.
Another option, which would work for any platform, is to simply extend your table with columns for different types
Key, IntegerValue, DoubleValue, StringValue, DateValue, IntervalValue, GeoLocValue, etc
This would mean your queries would be slightly different for each type; but then they probably would anyway because you'd have to make accommodations for the data varying if all jammed into a single data type.
By using the correct type for each date value you'll not only get better efficiency in processing them but also an implicit constraint for data quality. You can't put a date into an integer; but if they were all strings, you could.
A downside to this method is you'll have declare all of your values to be nullable to account for only value being populated for a given row.
And of course the last option, which is probably best...
Simply create different Key,Value tables. One table for each data type.
For the data to be used later for "visualization [Graphs] , Reporting and calculations" would you not sort of have to know what you will be getting for input?
I mean if one feed was average ice cream temps per city and another average height above sea level for a state. Not much you could do with those two sets of data.
As far as the various input types go, would you not also sort of have to know what format you will be getting from what source?
I would set up a parser for every feed I received that would parse the input then store the data appropriately in relational tables so it can easily be used later.
I mean if one feed was average ice cream temps per city and another average height above sea level for a state. Not much you could do with those two sets of data.
As far as the various input types go, would you not also sort of have to know what format you will be getting from what source?
I would set up a parser for every feed I received that would parse the input then store the data appropriately in relational tables so it can easily be used later.
ASKER
We are planning to write a parser that would parse the input and store the data in to a relational database... However we are not sure how the tables should be designed...the suggestion by sdstuber seems to be good
It really depends on the amount of data and the necessary performance of your reports and calculations.
But basically your data looks like you need an EAV model.
You may also consider using a data warehouse like star schema. But this depends on the semantics of your data.
But basically your data looks like you need an EAV model.
You may also consider using a data warehouse like star schema. But this depends on the semantics of your data.
ASKER
the data can be huge , our plan is to develop some thing in the lines of teh following softwares
http://www.waters.com/waters/en_US/Empower-3-Chromatography-Data-Software/nav.htm?cid=513188&locale=en_US
http://www.thermoscientific.com/en/product/dionex-chromeleon-7-2-chromatography-data-system.html
http://www.waters.com/waters/en_US/Empower-3-Chromatography-Data-Software/nav.htm?cid=513188&locale=en_US
http://www.thermoscientific.com/en/product/dionex-chromeleon-7-2-chromatography-data-system.html
>>> the data can be huge ,
If the data really is going to have large volume, then don't put anything between you and your data to make it less efficient.
Create your tables and columns with the correct data types.
If that means using multiple columns or multiple tables that's ok.
Better than ok, it's better than trying to stuff different types of data into a single structure.
If the data really is going to have large volume, then don't put anything between you and your data to make it less efficient.
Create your tables and columns with the correct data types.
If that means using multiple columns or multiple tables that's ok.
Better than ok, it's better than trying to stuff different types of data into a single structure.
In general I would prefer a type-safe EAV model. Thus for each data type in your value column using on table.
Using SQL Server 2012+ I would also consider using Columnstore Indexes with the data warehouse approach.
Using SQL Server 2012+ I would also consider using Columnstore Indexes with the data warehouse approach.
ASKER
One of the problems is that we need to support atleast 3 databases - SQL Server , Oracle and SQL Lite...
>> One of the problems is that we need to support atleast 3 databases
I highly recommend NOT trying to make a generic solution. One of the hallmarks of poor applications is sacrificing quality in the name of portability.
Create tables and columns with consistent names but create them using whatever datatypes are appropriate to that platform.
I highly recommend NOT trying to make a generic solution. One of the hallmarks of poor applications is sacrificing quality in the name of portability.
Create tables and columns with consistent names but create them using whatever datatypes are appropriate to that platform.
Well, @sdstuber is correct. With your 3 selected RDBMS there is no generic solution which perform best on all RDBMS. This mainly depends on SQL Lite.
But don't mix semantic model with implementation. Accessing your data via the same semantic views and procedures allows you to use the appropriate table model below. This can be optimized to the RDBMS and thus differ from RDBMS to RDBMS.
But don't mix semantic model with implementation. Accessing your data via the same semantic views and procedures allows you to use the appropriate table model below. This can be optimized to the RDBMS and thus differ from RDBMS to RDBMS.
ASKER
Ok... Can i summarize that the most acceptable solution would be to create tables for each data types and store the data
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
+1