DB design question: multi-level attribute boolean search implementation?

Okay, just for the information, the question is sort of DB independant, but the implementation will most probably be on MySQL.

So, there will be a database of documents, which each will have a number of properties and categories.
Example.

Document A
properties: TITLE, DESCRIPTION, EXPIRY_DATE, LANGUAGE, ...
categories: this document has let us say 3 levels of categories, like:
5 - Food - Textiles - Clothing - Watches - Household
5.18 Clothing
5.18.33 Shoes, shoe parts

so it has :
category level 1 = 5,
category level 2 = 18,
category level 3 = 33

Document B is of different type. It has some of the properties the same as document A, but some are different:
properties: TITLE, DESCRIPTION, RECIPIENTS, ...
categories: this document uses a different category system, which has let's say 5 category levels:

3 - general document
3.1 - memo
A - important
A.1.B.2.F - secrecy level: high

so it has:
categories of  level 1 = 3; A
category of level 2 = 1
category of level 5 = F

The second example is totally random, just to explain that category system can be different.
Also, of course, any document can be in more than one category per level (ie. doc B is in top level categories 3 _and_ A).

Typical queries might be:

give me all documents which are in category "C" in this classification system, or in any of subcategories of it OR ( documents that have EXPIRY_DATE more than year in future AND NOT have subcategory "5.18")

So, the questions are:

1. What would be a good general table design be for such document storage, that would allow cross-category search on arbitrary number of attributes and categories, considering that additional classification systems will be added later.

2. Any ideas about performance of such a design?

The query builder and parser in programming language (PHP) will be a separate pain, but that's something I've already done in past, so I could reimplement it and tweak a bit.

(Wanted to give 1000 points, but said max is 500, however, for a good and concise answer, I will give 1000 total somehow)
LVL 3
gnudiffAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
SjoerdVerweijConnect With a Mentor Commented:
s46: I think your solution would have very ugly queries for questions such as

"give me all documents which are in category "C" in this classification system, or in any of subcategories of it OR ( documents that have EXPIRY_DATE more than year in future AND NOT have subcategory "5.18")"

If that is the most common query, personally, I would go for the simple way:

Document
  DocumentID Not Null PK
  ...

Category
  CategoryID Not Null PK
  Name Not Null
  ParentCategoryID Null

DocumentCategory
  DocumentID Not Null
  CategoryID Not Null

This makes everything easy, except for determining the full name of the category (5.18.33). There's two options to handle this, at least on SQL Server:

- Create a table-valued user-defined function to look them up when needed (table-valued since there can be more than one category)

- Add a field to DocumentCategory that holds the full name, and populate it through triggers.


0
 
solution46Connect With a Mentor Commented:
gnudiff,

OK, had a bit of a think about it and I would suggest the following... The table structure is based on SQL Server datatypes and so on, but these can be converted easily enough.

[Category]
CategoryID (int, PK)
Level1 (tinyint)
Level2 (tinyint)
Level3 (tinyint)
Level4 (tinyint)
Level5 (tinyint)

[Document]
DocumentID (int, PK)
DocumentName nvarchar(50)
Description nvarchar(500) -- or whatever

[Recipient]
RecipientID (int, PK)
RecipientName (nvarchar(50)) -- or whatever
-- any other details you want to store about your recipients

[DocumentCategory]
DocumentID (int, PK, FK lookup to [Document]![DocumentID])
CategoryID (int, PK, FK lookup to [Category]![CategoryID])

[DocumentRecipient]
DocumentID (int, PK, FK lookup to [Document]![DocumentID])
RecipientID (int, PK, FL lookup to [Recipient]![RecipientID])


I would normally recommend a hierarchical design of the Category table, but in this case I think this design will give you much better performance. You will need to enter each of the categories in the table but I don't think it will get too big (I would be surprised if it reaches a thousand rows). Searching for a Category is easy. For instance, category 1.2.3 can be found using...
WHERE Level1 = 1
AND Level2 = 2
AND Level3 = 3
AND Level4 IS NULL
AND Level5 IS NULL

... and all children of 1.2.3 can be found by leaving off the Level4 and Level5 lines.


The [DocumentRecipient] table will allow you to send each table to multiple recipients and easily search for them...
FROM Document d
    INNER JOIN DocumentRecipient dr
        on d.DocumentID = dr.DocumentID
WHERE dr.RecipientID = ...

Performance of this design should be pretty good (assuming your indexes are good an up to date) as there is no particularly complex processing to handle. A hierarchical Category design would be much slower (recursive searching of 'all children category' type searches) and much more complicated to programme.



Hope this helps,

s46.
0
 
solution46Commented:
Sorry, just realised the categories can be alphanumeric. Change the datatypes of Level1 .. Level5 in [Category] to be text.

s46.
0
Get 10% Off Your First Squarespace Website

Ready to showcase your work, publish content or promote your business online? With Squarespace’s award-winning templates and 24/7 customer service, getting started is simple. Head to Squarespace.com and use offer code ‘EXPERTS’ to get 10% off your first purchase.

 
solution46Commented:
SjoerdVerweij,

I was thinking about a hierarchical (nested) data set for this but he's going to be slowed down quite a bit by the recursive seraching. My eventual thoughts were that this structure is simpler, is easier to programme and will run a bit faster. I know is isn't strictly ideal (non-normalised, etc., etc.)and the performance decrease wont be massive but if you can avoid using cursors for this, surely it has to be better?

If you think otherwise, I'd be really interested to hear your views as I have had to work on this sort of thing quite a few times myself.

Rgds,

s46.
0
 
solution46Commented:
hmmm...

SELECT *
FROM Document d
    INNER JOIN DocumentCategory dc
        ON d.DocumentID = dc.DocumentID
    INNER JOIN Category c
        ON dc.CategoryID = c.CategroyID
WHERE c.Level1 = 'C'                                 -- gets everything in cat C and all it's subcategoried
OR (DateDiff('d', c.ExpiryDate, getdate()) > 365              -- not sure about my use of DateDiff() but you get the idea...:)
    AND (Level1 <> '5' OR (Level1 = '5' AND Level2 <> '18')))

how would you go about doing this in a nested structure? This would probably be one of the few times I would use a cursor (hence my dislike of the approach) but if you have an efficient way of getting all the subcategories I would love to know what it is (will probably give you some points myself!)

Just thought, I'm assuming SQL Server here; if the better way is in another RDBMS then you don't get any points...!
   
s46.
0
 
SjoerdVerweijCommented:
Actually, you could do it in 1 SQL statement in SQL Server.... 2005  :-)

Otherwise, yep, you're looping (I'd use a while loop instead of a cursor). Depending on what you do most (selecting or updating), triggers or a UDF would take care of it.
0
 
solution46Commented:
*grin* still stuck with 2k here.

anyway, I don't think my query is too ugly... kinda cute, in a really sad sort of way :)))

s46
0
 
SjoerdVerweijCommented:
Cute  :-)

I was really thinking more along the lines of "give me all documents in this category, regardless of level" -- however, that's really an issue only if the categories are a graph, not a tree.
0
 
solution46Commented:
was working on the idea that for, say, a level 3 category to exist, it must share the same level 3 category as its parent and siblings: I think in this case my approach is more efficient. In any circumstance where a sibling would noy inherit properties from its parent like this, I agree with your nested / hierarchical approach.

Anyway, this poor chap is probably confused as hell by now :)

s46
0
 
SjoerdVerweijCommented:
Your solution is almost always more efficient when you're talking about server resources. Mine would be more efficient on your own time the second your manager walks in and says "how about a 6th level"?  :-)
0
 
gnudiffAuthor Commented:
Hello, and thanks for the answers.

The answer by solution46 is good, however, as I have no idea what might the upcoming category levels be, I think I will feel safer with the recursive table option given by SjoerdVervweij - the point about the 6th level might be well on spot here. Also the DBMS magazine links were very helpful.

I still have to think about the model, but thanks to both of you, I can now evaluate my options better.
0
 
solution46Commented:
Fair enough - if the structure is not fixed, then SV's answer is definately better.

Glad you're sorted,

s46.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.