Solved

DB design question: multi-level attribute boolean search implementation?

Posted on 2004-09-10
13
447 Views
Last Modified: 2012-06-27
Okay, just for the information, the question is sort of DB independant, but the implementation will most probably be on MySQL.

So, there will be a database of documents, which each will have a number of properties and categories.
Example.

Document A
properties: TITLE, DESCRIPTION, EXPIRY_DATE, LANGUAGE, ...
categories: this document has let us say 3 levels of categories, like:
5 - Food - Textiles - Clothing - Watches - Household
5.18 Clothing
5.18.33 Shoes, shoe parts

so it has :
category level 1 = 5,
category level 2 = 18,
category level 3 = 33

Document B is of different type. It has some of the properties the same as document A, but some are different:
properties: TITLE, DESCRIPTION, RECIPIENTS, ...
categories: this document uses a different category system, which has let's say 5 category levels:

3 - general document
3.1 - memo
A - important
A.1.B.2.F - secrecy level: high

so it has:
categories of  level 1 = 3; A
category of level 2 = 1
category of level 5 = F

The second example is totally random, just to explain that category system can be different.
Also, of course, any document can be in more than one category per level (ie. doc B is in top level categories 3 _and_ A).

Typical queries might be:

give me all documents which are in category "C" in this classification system, or in any of subcategories of it OR ( documents that have EXPIRY_DATE more than year in future AND NOT have subcategory "5.18")

So, the questions are:

1. What would be a good general table design be for such document storage, that would allow cross-category search on arbitrary number of attributes and categories, considering that additional classification systems will be added later.

2. Any ideas about performance of such a design?

The query builder and parser in programming language (PHP) will be a separate pain, but that's something I've already done in past, so I could reimplement it and tweak a bit.

(Wanted to give 1000 points, but said max is 500, however, for a good and concise answer, I will give 1000 total somehow)
0
Comment
Question by:gnudiff
  • 7
  • 5
13 Comments
 
LVL 9

Assisted Solution

by:solution46
solution46 earned 200 total points
Comment Utility
gnudiff,

OK, had a bit of a think about it and I would suggest the following... The table structure is based on SQL Server datatypes and so on, but these can be converted easily enough.

[Category]
CategoryID (int, PK)
Level1 (tinyint)
Level2 (tinyint)
Level3 (tinyint)
Level4 (tinyint)
Level5 (tinyint)

[Document]
DocumentID (int, PK)
DocumentName nvarchar(50)
Description nvarchar(500) -- or whatever

[Recipient]
RecipientID (int, PK)
RecipientName (nvarchar(50)) -- or whatever
-- any other details you want to store about your recipients

[DocumentCategory]
DocumentID (int, PK, FK lookup to [Document]![DocumentID])
CategoryID (int, PK, FK lookup to [Category]![CategoryID])

[DocumentRecipient]
DocumentID (int, PK, FK lookup to [Document]![DocumentID])
RecipientID (int, PK, FL lookup to [Recipient]![RecipientID])


I would normally recommend a hierarchical design of the Category table, but in this case I think this design will give you much better performance. You will need to enter each of the categories in the table but I don't think it will get too big (I would be surprised if it reaches a thousand rows). Searching for a Category is easy. For instance, category 1.2.3 can be found using...
WHERE Level1 = 1
AND Level2 = 2
AND Level3 = 3
AND Level4 IS NULL
AND Level5 IS NULL

... and all children of 1.2.3 can be found by leaving off the Level4 and Level5 lines.


The [DocumentRecipient] table will allow you to send each table to multiple recipients and easily search for them...
FROM Document d
    INNER JOIN DocumentRecipient dr
        on d.DocumentID = dr.DocumentID
WHERE dr.RecipientID = ...

Performance of this design should be pretty good (assuming your indexes are good an up to date) as there is no particularly complex processing to handle. A hierarchical Category design would be much slower (recursive searching of 'all children category' type searches) and much more complicated to programme.



Hope this helps,

s46.
0
 
LVL 9

Expert Comment

by:solution46
Comment Utility
Sorry, just realised the categories can be alphanumeric. Change the datatypes of Level1 .. Level5 in [Category] to be text.

s46.
0
 
LVL 18

Assisted Solution

by:SjoerdVerweij
SjoerdVerweij earned 300 total points
Comment Utility
0
 
LVL 9

Expert Comment

by:solution46
Comment Utility
SjoerdVerweij,

I was thinking about a hierarchical (nested) data set for this but he's going to be slowed down quite a bit by the recursive seraching. My eventual thoughts were that this structure is simpler, is easier to programme and will run a bit faster. I know is isn't strictly ideal (non-normalised, etc., etc.)and the performance decrease wont be massive but if you can avoid using cursors for this, surely it has to be better?

If you think otherwise, I'd be really interested to hear your views as I have had to work on this sort of thing quite a few times myself.

Rgds,

s46.
0
 
LVL 18

Accepted Solution

by:
SjoerdVerweij earned 300 total points
Comment Utility
s46: I think your solution would have very ugly queries for questions such as

"give me all documents which are in category "C" in this classification system, or in any of subcategories of it OR ( documents that have EXPIRY_DATE more than year in future AND NOT have subcategory "5.18")"

If that is the most common query, personally, I would go for the simple way:

Document
  DocumentID Not Null PK
  ...

Category
  CategoryID Not Null PK
  Name Not Null
  ParentCategoryID Null

DocumentCategory
  DocumentID Not Null
  CategoryID Not Null

This makes everything easy, except for determining the full name of the category (5.18.33). There's two options to handle this, at least on SQL Server:

- Create a table-valued user-defined function to look them up when needed (table-valued since there can be more than one category)

- Add a field to DocumentCategory that holds the full name, and populate it through triggers.


0
 
LVL 9

Expert Comment

by:solution46
Comment Utility
hmmm...

SELECT *
FROM Document d
    INNER JOIN DocumentCategory dc
        ON d.DocumentID = dc.DocumentID
    INNER JOIN Category c
        ON dc.CategoryID = c.CategroyID
WHERE c.Level1 = 'C'                                 -- gets everything in cat C and all it's subcategoried
OR (DateDiff('d', c.ExpiryDate, getdate()) > 365              -- not sure about my use of DateDiff() but you get the idea...:)
    AND (Level1 <> '5' OR (Level1 = '5' AND Level2 <> '18')))

how would you go about doing this in a nested structure? This would probably be one of the few times I would use a cursor (hence my dislike of the approach) but if you have an efficient way of getting all the subcategories I would love to know what it is (will probably give you some points myself!)

Just thought, I'm assuming SQL Server here; if the better way is in another RDBMS then you don't get any points...!
   
s46.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 18

Expert Comment

by:SjoerdVerweij
Comment Utility
Actually, you could do it in 1 SQL statement in SQL Server.... 2005  :-)

Otherwise, yep, you're looping (I'd use a while loop instead of a cursor). Depending on what you do most (selecting or updating), triggers or a UDF would take care of it.
0
 
LVL 9

Expert Comment

by:solution46
Comment Utility
*grin* still stuck with 2k here.

anyway, I don't think my query is too ugly... kinda cute, in a really sad sort of way :)))

s46
0
 
LVL 18

Expert Comment

by:SjoerdVerweij
Comment Utility
Cute  :-)

I was really thinking more along the lines of "give me all documents in this category, regardless of level" -- however, that's really an issue only if the categories are a graph, not a tree.
0
 
LVL 9

Expert Comment

by:solution46
Comment Utility
was working on the idea that for, say, a level 3 category to exist, it must share the same level 3 category as its parent and siblings: I think in this case my approach is more efficient. In any circumstance where a sibling would noy inherit properties from its parent like this, I agree with your nested / hierarchical approach.

Anyway, this poor chap is probably confused as hell by now :)

s46
0
 
LVL 18

Expert Comment

by:SjoerdVerweij
Comment Utility
Your solution is almost always more efficient when you're talking about server resources. Mine would be more efficient on your own time the second your manager walks in and says "how about a 6th level"?  :-)
0
 
LVL 3

Author Comment

by:gnudiff
Comment Utility
Hello, and thanks for the answers.

The answer by solution46 is good, however, as I have no idea what might the upcoming category levels be, I think I will feel safer with the recursive table option given by SjoerdVervweij - the point about the 6th level might be well on spot here. Also the DBMS magazine links were very helpful.

I still have to think about the model, but thanks to both of you, I can now evaluate my options better.
0
 
LVL 9

Expert Comment

by:solution46
Comment Utility
Fair enough - if the structure is not fixed, then SV's answer is definately better.

Glad you're sorted,

s46.
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Introduction: I have seen many questions on EE and elsewhere, asking about how to find either gaps in lists of numbers (id field, usually) ranges of values or dates overlapping date ranges combined date ranges I thought it would be a good …
In today’s complex data management environments, it is not unusual for UNIX servers to be dedicated to a particular department, purpose, or database.  As a result, a SAS® data analyst often works with multiple servers, each with its own data storage…
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now