How much to normalize?
Posted on 2013-12-28
I'm rewriting an an ASP net app and redesigning the database. The app is for collaborative construction of several distinct but related types of process development models, made up of model parts and annotations. I'm trying to figure out how much normalization I should apply.
The key part of the current database looks "something" like this in a logical view.
(Logical Schema 1)
ModelType1_Part (several different types of parts within ModelType1)
ModelType1_Part_AnnotationType1 (several different types of annotations)
ModelType1_Part_AnnotationType2 (several different types of annotations)
ModelType2_Part (several different types of parts within ModelType2)
ModelType2_Part_AnnotationCategory1 (several different types of annotations)
ModelType2_Part_AnnotationCategory2 (several different types of annotations)
ModelType2_Part_AnnotationCategory3 (several different types of annotations)
ModelType3_Part (several different types of parts within ModelType3)
ModelType3_PartAnnotation_Category1 (several different types of annotations)
ModelType2_PartAnnotation_Category2 (several different types of annotations)
There are other relationships among these objects
Each ModelType might have a relationship to other ModelTypes
i.e. a ModelType1 row may have a relationship with a ModelType2 row
Each ModelType_Part might have a relationship to the other ModelType_Parts
i.e. ModelType1_Part[n] might be a parent of ModelType1_Part[m]
i.e. ModelType1_Part[g] might be a parent of ModelType2_Part[k]
Annotations can be made on other Annotations
The Parts and Annotations elements are interrelated through Outline objects that create an ordering for them to be viewed in hierarchical lists.
None of these elements have a particularly complex set of columns; they all have a "main text" field, parent ID field, type field, and things like create date, and author.
The entire data set for one instance of the logical schema above will be less than 500KB 95% of the time, median will be about 250KB. The data is maintained in memory with reads from memory, write updates to disk, about one write per second per data set, with occasional full dataset reloads (once an hour?).
The current physical schema looks something like this:
(Current Physical Schema)
Model (for all model types) (median count per dataset = 4)
Outline (a set of rows that associate the same Parts and Annotations
in different ways for different ModelTypes; 1 Outline per model)
Vote (All Vote Types in one table) (median count per dataset = 2500)
(Votes have contextual meaning, so they are linked to the
Parts and Annotations through Outlines)
List (All Part Types and Annotation Types in one table)
(median count per dataset = 300)
(Every List element has at least one Outline element)
(about 12 columns that are not common to all Parts and Annotations)
(Note that List elements are associated with the Model Area,
(not the individual models).
The above scheme was done largely to keep the relationship between the Outline and the List as simple as possible.
- all the associations among and within the tables,
- the small scope and volume per any query on the dataset
- the fact that the query work will be mostly be done in memory,
- the fact that all these elements are abstracted parts of models of processes, so there is no concrete set of "end target" objects to optimize around,
then here's my current analysis about the value of more fully normalizing the current schema toward the logical schema at the top of this question:
- Benefits of greater normalization
- I'll save some space and processing and error checking by reduction of nulls.
- Costs of greater normalization
- Perhaps Several types of Model, Vote, Annotation and Part tables, each with essentially the same column structure.
- More joins (or the in-memory, class object equivalent)
- Rewriting the Outline algorithms (which may just need to happen anyway)
I can see going a little further toward normalization with the following:
(Proposed slightly more normalized schema)
Model (for all model types)
Vote (all Vote types in one table, with columns including
voter, vote parent type, parent id, vote type and vote value)
Part (all Part types for all Models in one table)
SubPartType1 (table with fields that are unique to some Part types)
SubPartType2 (table with fields that are unique to some Part types)
Annotation (all Annotation types in one table)
SubAnnotationType1 (fields that are unique to some Annotation Types)
SubAnnotationType2 (fields that are unique to some Annotation Types)
- Maybe joins between the separate Part and Annotation table would be faster than "Finds" within a combined table (of course, these are all going to be strongly typed class object Lists in memory...) But Parts would still have relationships to Parts, and Annotations would still have relationships to Annotations; so I'm not sure of the benefit of separating Parts and Annotations.
MY QUESTION to EExperts is of a general nature, to please offer guidelines, critiques and suggestions to help me decide about whether to:
- move toward the proposed slightly more normalized schema,
- or normalize further than that
- or stop with just adding the SubType tables,
- or not bother changing the current schema at all.
Any inputs on this would be appreciated.