Solved

How much to normalize?

Posted on 2013-12-28
33
390 Views
Last Modified: 2014-01-05
I'm rewriting an an ASP net app and redesigning the database.  The app is for collaborative construction of several distinct but related types of process development models, made up of model parts and annotations.  I'm trying to figure out how much normalization I should apply.

The key part of the current database looks "something" like this in a logical view.

(Logical Schema 1)

ModelArea
      ModelType1Group
          ModelType1
               ModelType1_Part   (several different types of parts within ModelType1)
                   ModelType1_Part_AnnotationType1 (several different types of annotations)
                   ModelType1_Part_AnnotationType2 (several different types of annotations)
                   ModelType1_PartVote_Type1
                   ModelType1_PartVote_Type2
      ModelType2
          ModelType2_Part  (several different types of parts within ModelType2)
               ModelType2_Part_AnnotationCategory1 (several different types of annotations)
               ModelType2_Part_AnnotationCategory2 (several different types of annotations)
               ModelType2_Part_AnnotationCategory3 (several different types of annotations)
               ModelType2_Part_VoteType1
               ModelType2_Part_VoteType2
      ModelType3
          ModelType3_Part (several different types of parts within ModelType3)
               ModelType3_PartAnnotation_Category1 (several different types of annotations)
               ModelType2_PartAnnotation_Category2 (several different types of annotations)
               ModelType3_Part_VoteType1
               ModelType2_Part_VoteType2


There are other relationships among these objects

Each ModelType might have a relationship to other ModelTypes
     i.e. a ModelType1 row may have a relationship with a ModelType2 row
Each ModelType_Part might have a relationship to the other ModelType_Parts
     i.e. ModelType1_Part[n] might be a parent of ModelType1_Part[m]
     i.e. ModelType1_Part[g] might be a parent of ModelType2_Part[k]
Annotations can be made on other Annotations

The Parts and Annotations elements are interrelated through Outline objects that create an ordering for them to be viewed in hierarchical lists.

None of these elements have a particularly complex set of columns; they all have a "main text" field, parent ID field, type field, and things like create date, and author.

The entire data set for one instance of the logical schema above will be less than 500KB 95% of the time, median will be about 250KB.  The data is maintained in memory with reads from memory, write updates to disk, about one write per second per data set, with occasional full dataset reloads (once an hour?).

The current physical schema looks something like this:

(Current Physical Schema)

ModelArea
    Model  (for all model types) (median count per dataset = 4)
        Outline  (a set of rows that associate the same Parts and Annotations
                      in different ways for different ModelTypes;  1 Outline per model)
             Vote   (All Vote Types in one table) (median count per dataset = 2500)
                       (Votes have contextual meaning, so they are linked to the
                       Parts and Annotations through Outlines)
   List      (All Part Types and Annotation Types in one table)
               (median count per dataset = 300)
               (Every List element has at least one Outline element)
               (about 12 columns that are not common to all Parts and Annotations)
               (Note that List elements are associated with the Model Area,
               (not the individual models).

The above scheme was done largely to keep the relationship between the Outline and the List as simple as possible.

Considering:
- all the associations among and within the tables,
- the small scope and volume per any query on the dataset
- the fact that the query work will be mostly be done in memory,
- the fact that all these elements are abstracted parts of models of processes, so there is no concrete set of "end target" objects to optimize around,  
 
then here's my current analysis about the value of more fully normalizing the current schema toward the logical schema at the top of this question:

-  Benefits of greater normalization
    -  I'll save some space and processing and error checking by reduction of nulls.
-  Costs of greater normalization
    -  Perhaps Several types of Model, Vote, Annotation and Part tables, each with essentially the same column structure.
    -  More joins (or the in-memory, class object equivalent)
    -  Rewriting the Outline algorithms (which may just need to happen anyway)

I can see going a little further toward normalization with the following:

(Proposed slightly more normalized schema)

ModelArea
    Model  (for all model types)
        Outline  
            Vote  (all Vote types in one table, with columns including
                      voter, vote parent type, parent id, vote type and vote value)  
    Part    (all Part types for all Models in one table)
            SubPartType1    (table with fields that are unique to some Part types)
            SubPartType2    (table with fields that are unique to some Part types)
    Annotation      (all Annotation types in one table)
            SubAnnotationType1    (fields that are unique to some Annotation Types)
            SubAnnotationType2    (fields that are unique to some Annotation Types)

- Maybe joins between the separate Part and Annotation table would be faster than "Finds" within a combined table (of course, these are all going to be strongly typed class object Lists in memory...)  But Parts would still have relationships to Parts, and Annotations would still have relationships to Annotations; so I'm not sure of the benefit of separating Parts and Annotations.

MY QUESTION to EExperts is of a general nature, to please offer guidelines, critiques and suggestions to help me decide about whether to:
-  move toward the proposed slightly more normalized schema,
-  or normalize further than that
-  or stop with just adding the SubType tables,
-  or not bother changing the current schema at all.

Any inputs on this would be appreciated.

Thanks!
0
Comment
Question by:codequest
  • 10
  • 8
  • 7
  • +2
33 Comments
 
LVL 45

Assisted Solution

by:Kdo
Kdo earned 100 total points
ID: 39744746
Hi codequest,

Some questions can't be properly answered with generalities.  "How much to normalize" is one of them as the answer lies in the data and that's unknown to us.

The first question is "how much data is there"?  In terms of storage requirements (1G, 100G, etc.) and row counts (about 10,000,000 rows now, growing by 2,000,000 per year, etc.)

The second question is "how much static or semi-static data is there?".  If the application manages the "accounts" of more than a million people or businesses, it might serve you well to normalize out the addresses.  If there's little or no duplication in the address data, you may still want to normalize out the addresses if typical usage or typical reporting doesn't need the address, or if reducing the row about 30-50 characters (street1, street2, city, state, zip) increases the average number of rows stored per block/page.

The third, and perhaps most critical, is "where is performance the most important?"  If you're running a web page with 10 million subscribers, a single row read of the user record is probably the most critical item while reporting is secondary.  One or two seconds to complete a signon in exchange for a 10 minute offline report is very fair.  If efficient access to the accounting records is more important than the user records, the design could be very different.

There are certainly other considerations, but determining how much to normalize is very much dependent on what you're building and how it will be used.  A good starting point is to be fairly aggressive in normalizing the static/semi-static data and very aggressive in denormalizing the transaction data.  The transaction data records "what happened" and can often be accomplished with a single row INSERT if the transaction data is not normalized.  If you normalize the transaction log, you require additional database overhead to acquire the critical keys that you want to store.  Even worse is that reporting against a normalized transaction log can be hugely inefficient as you have to join the other tables to a query that may result in millions of intermediate rows that often require multiple "full table scans" to analyze.


Kent
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39746916
In db-design-speak, typically everyone tries to normalize through 3rd normal form.  That means no repeating groups, and all data in a every table related to its whole key and nothing but the key (no transitive dependencies).

Btw, the volume of data is 100% IRRELEVANT to proper modeling of the data.

1-to-many relationships are modeled through an additional "relationship" table between the two elements, which includes the full keys of both parents and all other data for that specific relationship (and only that data).

I don't understand your data relationships here to get very specific, but something along the lines of these tables at least, but more tables are definitely needed as well:

Model

Part
?Part_Subtype1
?Part_Subtype2

Model_Part

Part_Annotation
0
 
LVL 45

Expert Comment

by:Kdo
ID: 39746985
The classroom answer is certainly to normalize to 3rd normal form, but in the real world  determining how much to normalize is very much dependent on what you're building and how it will be used.
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39747049
It shouldn't be.

The snide "classroom answer" is actually funny, since I've been a full-time DBA for 25 years+.

IF you're insufficiently skilled at data modeling, you might feel a need to short-cut the design process because "I only have a little data".  But that just demonstrates a lack of knowledge, rather than being justified as a part of using a "practical, real-world" approach".

You NEVER know when the data requirements might grow suddenly.

It's ridiculous to suggest that you should have to re-re-re-design a database as the data grows beyond certain volumes.
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39747212
I don't know of ANY reputable logical db design person and/or training site that even suggest that data volumes have anything to do with proper design.  I'd be interested in seeing such recommendations, or even suggestions, if Kdo knows of ANYONE stating that.
0
 
LVL 2

Author Comment

by:codequest
ID: 39747220
Hmm...  

Here's the story I'm telling myself...

There are approximately 40 part+annotation types.  Many annotation types can be children of multiple part types, in addition to being children of multiple annotation types.  In some views the part types are reversed to be shown as children of the annotations (i.e. "what part does this annotation belong to?")   There are about 8 vote types.  Several of the vote types can be applied to many different types of parts and annotations.

That's why the Outline object (of which there can be as many as needed) mediates all these variable parent-child relationships.   And that's why it doesn't seem to be useful to reflect these relationships in a data schema, which at its core would end up having at least 15 tables instead of 7.  Many of those tables would have identical column structures.  And their relationships would be locked, creating obstacles if I wanted to add more types and relationships later (driven by unforeseen user interest).  Also, I would be writing code that would say "Oh, which part type is this?  Ok, based on that go to Annotation table 1, instead of Annotation table 2, 3 or 4, and go to Vote table 2, instead of Vote table 1 or 3."  Plus deciding whether each Vote type group should have its own class object, or share a class object with a different Type field in it anyway." (Again, all the "joins" are going to be run in memory using "finds".)

Again, I'm not modeling a concrete set of "things in the world", consisting of a fixed set of entities and relationships (accounts, invoices, etc).  I'm modeling a modeling system, which is necessarily flexible in how it represents "things in the world", based on 25++ years of systems analysis and design experience.

So, here's the schema I'm currently planning to use, which is a small variation on my original schema.  Parts and Annotations are both Elements, that have two levels of type (main type, and sub-type) to cover the 40 odd Element types.    All Elements have a MainText field, a "what type of model did I start in" field, and a few record control fields.  So there will be no nulls.  Some Elements need a few extra fields;  those go in subordinate tables.  All Votes have the same fields, with based around parent type, parent id, vote type and votes value.   Again, no nulls.  Again, the DB activity is going to 99% writes, which occasional loads to memory, in about 300KB chunks, where all the reads to the user will take place.  The Outline objects (a set of rows in the Outline Table) keep all these glued together, and allows them to sequenced in several different hierarchies.  

End of talking out loud to myself...

I'm personally buying that story.  Any specific comments on this (annotations, votes, whatever) would be appreciated.  If this is too vague or abstract for further comment, which I appreciate it may be, maybe it's time to call it a day.
0
 
LVL 51

Assisted Solution

by:Mark Wills
Mark Wills earned 200 total points
ID: 39747273
Well, nowadays we are up to 6th Normal Form (and a couple of others in there as well).

Most recent "innovation" was penned in 2002. And has been an interesting topic of discussion ever sine 1970 when Codd first penned the first normal form.

And time does change approaches and attitudes...

More recently the advent of BI did see some denormalisation invade some designs, and now, "Big Data" is changing attitudes yet again.

The tools and the operating environments almost reinforce the changes of attitudes. As the technology improves, it can accommodate a few more latitudes. Almost as if there is a next-gen PNF developing (Practical Normal Form).


A few observations about your designs...

1) proposed is much better than old
2) always try to work out the true "owner" of a data element
3) I am unsure of the type1 and type2 representations if they are two attributes or a one to many relationship.
4) given subparttype1 and subparttype2 do they both belong exclusively to parts or are they two distinct attributes and should be named accordingly
5) is there the risk of duplication whereby what could be in subparttype1 could be subparttype2
6) same applies to annotationtypes as well...
7) always apply the same rules across multiple data sources, otherwise it can be a nightmare bringing them together, so if you do have other sources, check to see what they are doing.
8) always join wherever possible. But consider this... A join can result in duplicated rows (of the main table) so if they are two distinct attributes, do you want two columns, or two rows.
9) Future demand on data is a lot less predictable - there are increasingly more requirements and demands on data and no longer just OLTP, so flexibility in your design is paramount (imo).

There really is a truck load of possibilities, and the real "trick" with design these days is not just an understanding of the various normal forms, it is really knowing the data and knowing the business, and fulfilling the business requirement.

And of course, knowing what the design is intended for. As OLTP then you can normalise as much as you want, as a Datawarehouse you tend to denormalise a little. And of course if there are other apps and datasources, then they need to be considered as well.

My gut feel is you can go further, however, it is a big improvement and that might achieve your goals.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39747290
Sounds as if you are using your "Outline" as the main relationship manager where as purists would prefer to see the individual attributes have a more empirical relationship.

I see the word Hierarchy in there too, and the comment "which model did I come from" and 99% writes and now wondering if this requirement is more of a multi attributed BOM type design...

That makes it more challenging because of the recursive nature of hierarchies...

Some databases have introduced specific tools (such as SQL Server) to help with those style of designs.

Again it is hard to be specific, because we don't know the data, the business, the goals...
0
 
LVL 2

Author Comment

by:codequest
ID: 39747391
@Mike:  thanks for the input.

I'm being vague because the design is proprietary.

The app is for collaborative construction of several distinct but related types of process development models, each made up of model parts and annotations.   It's basically what happens in a systems development cycle, carrying information forward through different phases. So, imagine that you're in an online group, and with other people you are adding parts to these models, comments about the parts, comments about the comments, and ratings about the parts and comments.   And then, the first set of parts and comments gets transformed into another model, in which a second set of parts and comments is added in a new set of relationships with the original parts and comments.  Again, systems development cycle.

The trick to the underlying method (which has been proven hundreds of time in live meetings and using office automation tools on webex) is to make all this simple enough to be widely applicable while retaining analytic and persuasive power.  The purpose of the app is to remove or at least reduce the need for expert knowledge of the underlying method.

So, the basic representation of a Part is a Part Type, a Text Phrase, and a Parent Part.  The Part can belong to multiple models, and  multiple relationships to other Parts, including having different Parents in different model views.  So the Outlines provide those relationships to produce those views.   Imagine a requirement, which is treated differently in a specification than it does in a plan, but it's still just the same text phrase, and in fact the same little entity.  Comments ("Annotations") are basically the same thing as Parts in terms of required data fields and flexibility of relationships.  There are some Part types and some Annotation types that require a few extra fields, so yes, create little extender tables for those other fields.  But at some point the display relationship between Annotations and Parts will be flipped ("That's an important requirement;  what module does it refer to?")  

So, this is where my knowledge of DB design falls down.  I'm not able to see a benefit in terms of storage, performance, coding complexity etc, of putting Parts and Annotations in separate tables.  Again, considering that one of these multi-model data sets is going to have probably fewer than 500 combined Parts and Annotations, totally 300KB, 99% of the time, and be worked on almost entirely in memory using class objects, with essentially zero connection to any other data in the entire database.

If I could understand that, I could apply the same lesson to Votes.

I'm going to construct a more refined ERD, to try to normalize further, to see what I get.   But I'm running into a conceptual barrier that the different varieties of Parts, Annotations and Votes just don't have data structures or behaviors that are significantly different.     So I'm trying to figure out why I would want to create more tables, more class objects, and more rigid relationships to reflect those secondary or tertiary differences.  Yes, I can say that certain annotations only attach to certain Parts, and certain types of Votes relate to only certain types of Parts or Annotations.  But what does it buy me to make those into rigid physical distinctions?

Happy to learn, though.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39747433
Don't worry too much about how the tables might look, that can be a trap. We often see an all encompassing reference table (EAV) where multiple unrelated data elements are stored, but that is typically because it is "easy" and not reflective of the real relationships.

And must admit, I have used that approach in some cases to great effect.

While it might be tempting to create a table with "RecordType, PK, Value, Description, WhateverElse" where record type is one of "Vote","Part","Annotation" then a change of requirement to "Vote" means you need to affect all other Types, and that might impact on a wide variety of other dependencies (like your code).

If there is no relationship between those three, other than connected to Models, then keep them separate. The joins will be cleaner, the data structures will be pertinent (albeit very subtly difference) and for model you can retrieve the separate attributes so it is easier to "flip" between showing Annotations or Parts (or Votes).

Yes, it may require another query, but the pointers should all be there. And if a specific query with the right unique keys and indexes, then performance should not be a major problem.

Have a play with your ERD, and look at the data elements not in terms of similar structures as much as unique attributes.
0
 
LVL 2

Author Comment

by:codequest
ID: 39747479
@Mike:  Thanks for the input.  I got the core schema easily up from 7 to 13 tables.  And now it's starting to make sense.   I appreciate the comment about the structure of say Parts changing and then affecting Annotations (Votes would be different in any case).

It also makes me think about the speed (and resource usage) of the "Finds", if the Models' Type Parts can fully be separated in terms of appearing only in separate views, even if they only do that once in a while. Since most views will be controlled by an Outline (which translates into a sequential read index against a set of otherwise un-ordered Model Parts, i.e. added somewhat randomly), if the set of Model Parts that the Outline points to is smaller, the view will be constructed faster and more economically (faster Finds, fewer CPU cycles).  

So, getting back to basics (which I thought I understood, because its a "rewrite", but doh!)
>  Review/Catalog all the possible views, and rate them for frequency of use
>  Separate the data by how it's going to be accessed in terms of speed/CPU usage

I'm going to make a couple/few different ERDs and ponder those with those ideas and your other suggestions in mind (and those of Kdo and Scott :0)

Very helpful input.  I'm not sure we can do much more with this;  you've been very generous.  I may sit on this a day or two and then close it out.

Grazie!
0
 
LVL 51

Accepted Solution

by:
Mark Wills earned 200 total points
ID: 39747498
*laughing* Well, would love to look under the "proprietary" bonnet - it could make a big difference (or simply add to the confusion - not sure).

Frequency of use and Speed/ CPU usage are important design considerations, but then you will also need an understanding of index structures and database optimisation techniques as well (and IO caching etc). So, depending on MySQL or SQL Server, there might be subtle differences, and while it is important, a good design should apply regardless.

There are choices for you app interface to the data as well. Creating views over the tables, use of stored procedures and so on can mean the hard work is on the server and might make it more straight forward (with reduced data traffic) for your app. But that is another layer over the top of what should be a good data design in the first place.

Best to consider the better design after the ERD's, and depending on your ERD tool of choice, compare the relative efficiencies of each different design. And always, always, look at how you are going to retrieve the data.

Good luck with your ERD designs, and hope to hear back how that works out for you.

Happy 2014 :)
0
 
LVL 26

Assisted Solution

by:Zberteoc
Zberteoc earned 100 total points
ID: 39748246
My goodness, you could right a book about database theory with these answers. :o) However that doesn't help the asker too much.

I am not saying that I have a solution but I think a more practical approach would be to determine what are the possible issues that will post problem in the future in regards to 2 parameters: storage space and performance.

If the storage presents a problem then normalizing as deep as you can will help because normalizing is about eliminating to data redundancy in the first place. The drawback here is that you will have to use more joins in order to retrieve data, which could translate in slower access.

If performance is an issue then denormalizing here and there would help because denormalizing is about faster access. The drawback in this case is that it needs more storage and also maintaining is more difficult as modification for the same piece or information would have to be made in more places.

There is also the aspect of the best of 2 worlds so the complete question should be: "how much to normalize and how much to denormalize". You could have a normalized database to address the data storage and you could have a denormalized database that would address the performance in retrieving that data. More like a reporting or even warehouse databases with pre-summarized data that would have to be updated periodically in a more or less automated manner. In the end the way the data is used should give you the hint which way to incline more. Fast storage with less space - normalize, fast  access - denormalize.

Anyway this kind of questions are too complex for a forum like EE and the answers will never be too specific. These are more like general directions but when it comes to address a real case application then the specifics have to be taken into considerations.
0
 
LVL 2

Author Comment

by:codequest
ID: 39749306
FYI, I got an interesting related result here (which I felt was a separate question):

http://www.experts-exchange.com/Programming/Languages/.NET/Q_28328524.html
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39750178
The primary purpose of normalization is to prevent data anamolies, i.e. inconsistency or errors, not to save/use disk space.

And that's what we see in practice all the time -- a superficial, super-quick "design" biased to what is in place today, and on-going data inconsistencies and errors.

The "best practice" approach is to do a proper logical design.  Then convert that logical design to a physical design.  It's not that long a process if you have someone that knows how to do it and oversee it.

I'm certain that whoever did the original design felt it was "good enough" at that time as well.  So, you can continue that same error and re-re-re-design cycle, or you take a bit more time, do one proper design, and simply refine the code as coding advances are made.
0
 
LVL 26

Expert Comment

by:Zberteoc
ID: 39750277
Scott is right with the exception of disregarding the space problem, which at the beginning of databases usage was an important aspect. There were no gigabytes floating around. Definition in wikipedia:

Database Normalization: Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.

Redundancy: Data redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, when customer data is duplicated and attached with each product bought then redundancy of data is a known source of inconsistency, since customer might appear with different values for given attribute.[1] Data redundancy leads to data anomalies and corruption and generally should be avoided by design.[2] Database normalization prevents redundancy and makes the best possible usage of storage.
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 2

Author Comment

by:codequest
ID: 39750453
Thanks for the input.  Perhaps an example would help.  What are some use cases in which:

---One Vote Table---
ID
ParentType  (ParentType 1, 2, 3, or 4)
ParentID    (within ParentType 1, 2, 3, or 4...but Finds and Joins would also use ParentType)
VoteType   (Type 1, 2, 3, or 4)
VoteValue  (one of several possible values among all Types)

could produce more errors or inconsistencies than

-- multiple vote tables----
(VoteType1Table)
ID
ParentID  (within ParentType1)
VoteValue   (one of several possible values among Type1)

(VoteType2Table)
ID
ParentID  (within ParentType3)
VoteValue   (one of several possible values among Type2)

(VoteType3Table)
ID
ParentID  (within ParentType3)
VoteValue   (one of several possible values among Type3)

(VoteType4Table)
ID
ParentID  (within ParentType4)
VoteValue   (one of several possible values among Type4)
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39750542
Database design and application design are two different things.

There are crossover points of course, but a good database design is paramount and needs to be considered almost in isolation. Good design should automatically take care of duplicated data (or as wiki might say "redundant" - but IMO no data is redundant unless it has no use, rather, it is just not in the correct place and therefore repeated unnecessarily).

Now, the VoteTypeTable example...

Why is there 4 tables ?

They are all votetype, so why (how) is votetype1 different to votetype4 is there something very specific about that number that warrants 4 different tables ?

If there is a limit of 4 votes, then a constraint can be added. If there must be one of 4 different votes, when what is the basis of the 4 differences ?

Seems that some aspects of the application design are compromising the database design. And there is no real difference (apart from <ahem> interesting </ahem> design choices) between the two examples above other than "how to retrieve".

Each one could become just as compromised without the relational constraints (or other constraints like a check constraint) on what the vote type 1,2,3,4 actually means.

As for potential differences... From a join perspective, having 4 different tables means 4 joins and they translate into four columns. Having the one table (being joined once) means 4 rows. Having an ID means it is still quite possible to have more than one ID for a parent for a type1 vote. Meaning you will need to then deal with multiple columns and multiple rows and depends on the type / nature of data relationships eg dependant on the type of join.

Lets assume an inner join as an example (and maybe oversimplifying). Having all votetypes in one table will mean you get a result if any 1 of the 4 votetype's exist. An inner join (for example) on the 4 different tables will mean no rows will be returned unless all 4 tables have a row that satisfies the join criteria.

If there MUST be 4 votetypes before a row can be retrieved then there is a different relationship that has yet to be defined. Or maybe they are optional (so change the join type).
0
 
LVL 2

Author Comment

by:codequest
ID: 39750652
Thanks for the input.

In the example, the vote parents are all different model type parts; e.g. (not real examples) lists of requirements and lists of users.

So the Vote Types are first distinguished by their parent type.

Then, the Vote Type have different distinct categories of functional meaning:  forward/hold;  value rank;  like/dislike;  bookmark; etc.   likes/dislikes on requirements have nothing to do with value rankings on users (example sucks, but hopefully it gets the point across).   There might be several values within each Vote Type category (e.g. like, dislike, really like, really dislike) but within a Vote Type they all apply to the same "reason I cast this vote".

The vote types will in a small fraction of form displays need to be combined in a single row under one parent line, but only for presentation purposes (e.g.  on one model part display line, show like/dislike votes (to others) and bookmark votes (for myself).  But usually not.  And votes are never combined on a single display row across different model types (again, votes on requirements have nothing to do with votes on users).

So we have a cross reference

                           ModelPartType1     ModelPartType2    ModelPartType3

VoteType1                votetable                      N/A                     votetable    
                 
VoteType2                  N/A                          votetable                votetable                      

VoteType3               votetable                      N/A                        votetable                      

VoteType4                  N/A                           votetable                votetable                      


This gives eight vote tables, each with a different parent and each with a different functional meaning.

But again, why is this riskier than

                                                 ModelPartType1     ModelPartType2    ModelPartType3

ID  (identity)
ParentType (ModelPartType)  
ParentID                                                               One Vote Table
VoteType  
VoteValue
VoterID  

with a join, when I want to know how many of a particular VoteType rows there are for any particular ModelPartType row, of something like this:

Where ParentType = @ModelPartType  (known from the context of the making the query)
        and   ParentID = @ParentID   (the ID of the known particular parent record)
        and   VoteType =  @VoteType  (the particular vote type I'm looking for, from context)

I'm never going to not know what the values of the @ input variables are, either for write or read, and there will never be duplicate ID's in the Parent Tables (they will all be identity fields).

Right now, I would choose the option with several vote tables, because the retrieval time is faster on typed Lists (or Dictionaries) serving as in-memory proxies for the tables, because each vote table is smaller.  Even though it means writing eight instead of one Vote interfaces to the DB, eight instead of one retrieval and write functions, etc.

But I still don't see how errors might be created in the "one vote table schema".  And maybe in this case they would not be.  Which does not refute the general principle of using normalization to keep the date clean.  It just may not apply in this case.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39750735
Not really... Many little tables does not always mean faster because each in turn needs to be cached rather than taking advantage of the one table that is cached and having more space available for that one - especially if the key of parent-id is going to mean that a "grab" of data (from db engine) is going to help solve types 1,2,3,4

If you are going to provide @modelparttype, @parented, @votetype then you have no real advantage in having 4 tables because then your query has to be structurally altered to make sure you are getting the right source table.

From an app perspective, sure, you don't want to transfer large volumes travelling over the wire from server to clientside. You want the strength of the db server to do as much heavy lifting as it can delivering as finer tuned result set as is possible. You aren't going to load each table into memory and then source your query answers from that, you do want to resolve your query and have the complete query as  the "answer" result set.

In your first example matrix, it is just as easy to get the one table to look the same, but can do so more efficiently.

The only justifiable reason to have 4 tables is if there really is a different constraint or relationship between the "parent" and each of the 4 types depending on the individual type. if that makes sense...

Even if you want to access via the 4 types, then the table structures should still follow the nth normal form principles. You can always construct appropriate views / stored procedures etc to present the data the way you want. The structure is one thing, the presentation is another.

e.g.

VoteType Table structure :
 ID  (identity)
 ParentType (ModelPartType)  
 ParentID                                                              
 VoteType  
 VoteValue
 VoterID  

But can then create "views"
create view vw_votetype_1 as
Select
 ID  (identity)
 ParentType (ModelPartType)  
 ParentID                                                              
 VoteType  
 VoteValue
 VoterID  
from tbl_votetype
where votetype = 1

And the "engine" will resolve that to the best optimisation it can.

But still, if there is a disjoint between votetype 1 and votetype 4 then are they actually the same thing ? if distinct and independent attributes then name them differently and each should have its own relationship.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39750818
Now, you might get a more understanding or sympathetic response if (say) votetype determines how the "value" needs to be interpreted (for example votetype 1 and value = 1 actually means "Yes")

But we also have parenttype and parentID and shouldn't parenttype be an attribute of parent on the parent table which is llinked to (joined) via parentID ? Or, are we trying to use parenttype to identify what table it should link to ?

When you start to build in conditions within the table to decide how the data within that table needs to be used then that raises questions and a good design should not.

When we start getting these types of questions, then the design is not clear as to what the relationships should be. And if not clear on paper then the DB possibly wont be able to perform to its optimal level because there will need to be (always) the additional predicates to clarify, and the more you have to add in to your "where" or "on" then the easier it is for the optimizer to say "screw this we will simply scan the lot" if it is not immediately obvious, or too complex.

It can seem quite daunting, and for every good design, there is a myriad of ordinary designs. The ordinary designs survive because they work and they are understood well enough to achieve the business requirement. That doesn't make them good, just makes them workable. A bit like the very original design you had in the question header.

But hang in there, you will get a design that makes perfect sense to you, and you'll be able to code against it...
0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 100 total points
ID: 39751388
Normalizing doesn't necessarily mean single tables for everything.  It depends on the data requirements.
>>
---One Vote Table---
ID
ParentType  (ParentType 1, 2, 3, or 4)
ParentID    (within ParentType 1, 2, 3, or 4...but Finds and Joins would also use ParentType)
VoteType   (Type 1, 2, 3, or 4)
VoteValue  (one of several possible values among all Types)
<<

That could be perfectly normalized: no repeating values, everything dependent on the key ... which brings up the key point in design.

First, GET RID OF AN ID/IDENTITY COLUMN AS BEING IN (ALMOST) EVERY TABLE BY DEFAULT!!  That is THE single biggest myth about table design (and clustering).  Today, that myth causes more performance issues than the next top 5 design problems combined.

The logical key for the entity, and the clustering index for the table, should likely be ( ParentType, ParentID, VoteType ).  That natural key that will perform vastly better on joins.  Yes, it's multiple columns and more bytes than a single integer, but so what: those things are NOT automatically wrong (besides, it,'s not that much wider anyway, since ParentType and VoteType can both be tinyints, just 1 byte each).  There could even be multiple rows with those keys, and even that's OK too.

You need someone experienced to guide this process, though.  This is certainly not a natural thing for most people, even many DBAs -- some DBAs are really physical-only DBAs, not physical and logical DBAs.
0
 
LVL 45

Expert Comment

by:Kdo
ID: 39751465
Hi codequest,

I'm going to agree with Mark on the design.  If the natural order of the items in that table is the event timing (as they occur), you'll get a lot better performance with an identity column that functions as the primary key than you will from joining a composite key.  A single column (scalar) integer join is the fastest join operation that the DBMS can execute.
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39751489
LOL ... and the identity myth continues to be propagated.

In abstract, a single integer join is fastest.  In practice, it greatly depends on the table relationships and how the tables are clustered.

70+% of my tuning consists of just changing the dopey identity cluster to something better for joins.  I've gotten 99% reductions in I/O just from this change.

[Btw, I am genuinely surprised that so many people with no professional experience doing db design assume themselves to be experts at it: I don't tell my heart doctor to step aside and let me do my own EKG.]
0
 
LVL 45

Expert Comment

by:Kdo
ID: 39751545
In practice, it greatly depends on the table relationships and how the tables are clustered.

I thought I made that qualification.

[I'm surprised at the number of DBAs that fancy themselves architects.  I've spent a large part of my career coming in behind the amateurs with "this is how it should have been done" solutions.'
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39751611
I've done all areas of data design, from beginning to end.

If you can't even allow for a logical design separate from the physical design, you're absolutely not a data architect.  You're a coder/developer from all statements you're made.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39751745
Fair dinkum Scott...

We all know your Surrogate v Natural stance but there is no need for you to carry on the way you do with your disparaging remarks about other experts whenever identity is mentioned.

FWIW the clustering index versus unique index versus primary key can well be three different constructs, but you always argue that the Natural key must be the PK and/or as the unique index and/or (above) the clustering index.

Do a search, get some big name authors, and check what they say about surrogate v natural keys.  You will find plenty of argument either which way. Even the founding fathers of relational design argue the point (E. Codd and C. Date)  and lay the foundation to differentiate the surrogate from tuple ID's even with other candidate keys.

Now, in terms of experience, I was playing with SQL Server when Sybase was writing it on both OS/2 and VMS and then involved in comparing / debugging Microsoft's first "solo" effort in 95 and 96. And in terms of design, have used both natural and surrogate keys. More recently (since 2005) I do have a stronger tendency toward the Identity because of the way SQL structures its data and how the optimizer works (especially with FK). But that doesn't stop other indexes, rather, it actually helps with their performance.

So, while you might be fine giving yourself an ECG's, that is not the real problem... it is how to interpret the results that you want your heart doctor for...

Can we please now all agree to disagree and stop the calling out ?
0
 
LVL 69

Expert Comment

by:ScottPletcher
ID: 39751866
I don't say a natural key must always be it.  But, in the logical design especially, a genuine natural key should be used in preference to an identity.  If you later need to change this to an identity/sequence in the physical design for performance reasons, you can do so (although you really don't need to do it nearly as often as people assume).

I say an identity should never be assumed until the requirement for it is known.  That is, it's a serious mistake to automatically add an identity column to every table (as above).  That's not a design, and it's not a good substitute for a design, it's just a crutch.

What it ends up doing is hiding the real relationships in the data and thus causing a poor logical design.  After all, every entity automatically has a unique "key", so any design looks "good".  The actual, natural keys would help point out anomalies and design issues much earlier in the design process, which is always better.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39751934
Thanks Scott, I stand corrected about your stance, and yes, the real relationships of data must be exposed and accommodated as part of the design.
0
 
LVL 26

Expert Comment

by:Zberteoc
ID: 39751942
"What it ends up doing is hiding the real relationships in the data and thus causing a poor logical design.  After all, every entity automatically has a unique "key", so any design looks "good".  The actual, natural keys would help point out anomalies and design issues much earlier in the design process, which is always better."

I don't see how that happens. In the example with the vote table the real relationships will still be defined by the 3 columns ParentType, ParentId and VoteType, the ID column doesn't have any role in relation. The 3 columns can still be declared as unique index. It is true that this unique index will make use of the ID column to actually point to the data unless you use include as there is only one column that is needed, the value.

In ID identity PK cluster column helps with the inserts because data doesn't need to be rearranged every time as the combinations of the 3 columns is not guaranteed to be in ascending order. With the ID the inserts will practically append to the table.

In this particular case the performance impact cannot be significant but it might be in others. Maybe Scott has some example from his experience.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39752013
If you think of the ID as a "device" the real (or logical) data relationships are the candidate keys.

If you then ascertain that the data model for a particular table is best served by the "device" then at least you have designed the logical relationships in the first instance.

If you design the table so that the ID is the unique identifier then you might not "see" the true relationships. If you are inexperienced with the "device" then it becomes the focal point and potentially clouds the issue.

It is a fine distinction, but understand what Scott was saying.

Not entirely sure it will be clarifying or confusing to the Asker though. Not entirely sure it is a design fault in this case as much as a reaction to a comment about "all tables will have an identifier" and worthwhile raising. Preferably without the digs.

And based on the argument of real (or logical) data relationships, that is what I have been digging into with regard votetype 1,2,3,4 and Parenttype 1,2,3,4
0
 
LVL 2

Author Closing Comment

by:codequest
ID: 39757735
Completely fair point allocation was beyond my immediate algorithmic resource capabilities, so I did it based on quick pass intuition.

Thanks to all for the inputs.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Suggested Solutions

Introduction Since I wrote the original article about Handling Date and Time in PHP and MySQL (http://www.experts-exchange.com/articles/201/Handling-Date-and-Time-in-PHP-and-MySQL.html) several years ago, it seemed like now was a good time to updat…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Via a live example, show how to shrink a transaction log file down to a reasonable size.
Viewers will learn how the fundamental information of how to create a table.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

27 Experts available now in Live!

Get 1:1 Help Now