Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

DB design question: multi-level attribute boolean search implementation?

Posted on 2004-09-10
13
Medium Priority
?
496 Views
Last Modified: 2012-06-27
Okay, just for the information, the question is sort of DB independant, but the implementation will most probably be on MySQL.

So, there will be a database of documents, which each will have a number of properties and categories.
Example.

Document A
properties: TITLE, DESCRIPTION, EXPIRY_DATE, LANGUAGE, ...
categories: this document has let us say 3 levels of categories, like:
5 - Food - Textiles - Clothing - Watches - Household
5.18 Clothing
5.18.33 Shoes, shoe parts

so it has :
category level 1 = 5,
category level 2 = 18,
category level 3 = 33

Document B is of different type. It has some of the properties the same as document A, but some are different:
properties: TITLE, DESCRIPTION, RECIPIENTS, ...
categories: this document uses a different category system, which has let's say 5 category levels:

3 - general document
3.1 - memo
A - important
A.1.B.2.F - secrecy level: high

so it has:
categories of  level 1 = 3; A
category of level 2 = 1
category of level 5 = F

The second example is totally random, just to explain that category system can be different.
Also, of course, any document can be in more than one category per level (ie. doc B is in top level categories 3 _and_ A).

Typical queries might be:

give me all documents which are in category "C" in this classification system, or in any of subcategories of it OR ( documents that have EXPIRY_DATE more than year in future AND NOT have subcategory "5.18")

So, the questions are:

1. What would be a good general table design be for such document storage, that would allow cross-category search on arbitrary number of attributes and categories, considering that additional classification systems will be added later.

2. Any ideas about performance of such a design?

The query builder and parser in programming language (PHP) will be a separate pain, but that's something I've already done in past, so I could reimplement it and tweak a bit.

(Wanted to give 1000 points, but said max is 500, however, for a good and concise answer, I will give 1000 total somehow)
0
Comment
Question by:gnudiff
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 5
13 Comments
 
LVL 9

Assisted Solution

by:solution46
solution46 earned 800 total points
ID: 12026711
gnudiff,

OK, had a bit of a think about it and I would suggest the following... The table structure is based on SQL Server datatypes and so on, but these can be converted easily enough.

[Category]
CategoryID (int, PK)
Level1 (tinyint)
Level2 (tinyint)
Level3 (tinyint)
Level4 (tinyint)
Level5 (tinyint)

[Document]
DocumentID (int, PK)
DocumentName nvarchar(50)
Description nvarchar(500) -- or whatever

[Recipient]
RecipientID (int, PK)
RecipientName (nvarchar(50)) -- or whatever
-- any other details you want to store about your recipients

[DocumentCategory]
DocumentID (int, PK, FK lookup to [Document]![DocumentID])
CategoryID (int, PK, FK lookup to [Category]![CategoryID])

[DocumentRecipient]
DocumentID (int, PK, FK lookup to [Document]![DocumentID])
RecipientID (int, PK, FL lookup to [Recipient]![RecipientID])


I would normally recommend a hierarchical design of the Category table, but in this case I think this design will give you much better performance. You will need to enter each of the categories in the table but I don't think it will get too big (I would be surprised if it reaches a thousand rows). Searching for a Category is easy. For instance, category 1.2.3 can be found using...
WHERE Level1 = 1
AND Level2 = 2
AND Level3 = 3
AND Level4 IS NULL
AND Level5 IS NULL

... and all children of 1.2.3 can be found by leaving off the Level4 and Level5 lines.


The [DocumentRecipient] table will allow you to send each table to multiple recipients and easily search for them...
FROM Document d
    INNER JOIN DocumentRecipient dr
        on d.DocumentID = dr.DocumentID
WHERE dr.RecipientID = ...

Performance of this design should be pretty good (assuming your indexes are good an up to date) as there is no particularly complex processing to handle. A hierarchical Category design would be much slower (recursive searching of 'all children category' type searches) and much more complicated to programme.



Hope this helps,

s46.
0
 
LVL 9

Expert Comment

by:solution46
ID: 12026734
Sorry, just realised the categories can be alphanumeric. Change the datatypes of Level1 .. Level5 in [Category] to be text.

s46.
0
 
LVL 18

Assisted Solution

by:SjoerdVerweij
SjoerdVerweij earned 1200 total points
ID: 12027993
0
Survive A High-Traffic Event with Percona

Your application or website rely on your database to deliver information about products and services to your customers. You can’t afford to have your database lose performance, lose availability or become unresponsive – even for just a few minutes.

 
LVL 9

Expert Comment

by:solution46
ID: 12028997
SjoerdVerweij,

I was thinking about a hierarchical (nested) data set for this but he's going to be slowed down quite a bit by the recursive seraching. My eventual thoughts were that this structure is simpler, is easier to programme and will run a bit faster. I know is isn't strictly ideal (non-normalised, etc., etc.)and the performance decrease wont be massive but if you can avoid using cursors for this, surely it has to be better?

If you think otherwise, I'd be really interested to hear your views as I have had to work on this sort of thing quite a few times myself.

Rgds,

s46.
0
 
LVL 18

Accepted Solution

by:
SjoerdVerweij earned 1200 total points
ID: 12029111
s46: I think your solution would have very ugly queries for questions such as

"give me all documents which are in category "C" in this classification system, or in any of subcategories of it OR ( documents that have EXPIRY_DATE more than year in future AND NOT have subcategory "5.18")"

If that is the most common query, personally, I would go for the simple way:

Document
  DocumentID Not Null PK
  ...

Category
  CategoryID Not Null PK
  Name Not Null
  ParentCategoryID Null

DocumentCategory
  DocumentID Not Null
  CategoryID Not Null

This makes everything easy, except for determining the full name of the category (5.18.33). There's two options to handle this, at least on SQL Server:

- Create a table-valued user-defined function to look them up when needed (table-valued since there can be more than one category)

- Add a field to DocumentCategory that holds the full name, and populate it through triggers.


0
 
LVL 9

Expert Comment

by:solution46
ID: 12029285
hmmm...

SELECT *
FROM Document d
    INNER JOIN DocumentCategory dc
        ON d.DocumentID = dc.DocumentID
    INNER JOIN Category c
        ON dc.CategoryID = c.CategroyID
WHERE c.Level1 = 'C'                                 -- gets everything in cat C and all it's subcategoried
OR (DateDiff('d', c.ExpiryDate, getdate()) > 365              -- not sure about my use of DateDiff() but you get the idea...:)
    AND (Level1 <> '5' OR (Level1 = '5' AND Level2 <> '18')))

how would you go about doing this in a nested structure? This would probably be one of the few times I would use a cursor (hence my dislike of the approach) but if you have an efficient way of getting all the subcategories I would love to know what it is (will probably give you some points myself!)

Just thought, I'm assuming SQL Server here; if the better way is in another RDBMS then you don't get any points...!
   
s46.
0
 
LVL 18

Expert Comment

by:SjoerdVerweij
ID: 12029473
Actually, you could do it in 1 SQL statement in SQL Server.... 2005  :-)

Otherwise, yep, you're looping (I'd use a while loop instead of a cursor). Depending on what you do most (selecting or updating), triggers or a UDF would take care of it.
0
 
LVL 9

Expert Comment

by:solution46
ID: 12029902
*grin* still stuck with 2k here.

anyway, I don't think my query is too ugly... kinda cute, in a really sad sort of way :)))

s46
0
 
LVL 18

Expert Comment

by:SjoerdVerweij
ID: 12029937
Cute  :-)

I was really thinking more along the lines of "give me all documents in this category, regardless of level" -- however, that's really an issue only if the categories are a graph, not a tree.
0
 
LVL 9

Expert Comment

by:solution46
ID: 12030602
was working on the idea that for, say, a level 3 category to exist, it must share the same level 3 category as its parent and siblings: I think in this case my approach is more efficient. In any circumstance where a sibling would noy inherit properties from its parent like this, I agree with your nested / hierarchical approach.

Anyway, this poor chap is probably confused as hell by now :)

s46
0
 
LVL 18

Expert Comment

by:SjoerdVerweij
ID: 12031517
Your solution is almost always more efficient when you're talking about server resources. Mine would be more efficient on your own time the second your manager walks in and says "how about a 6th level"?  :-)
0
 
LVL 3

Author Comment

by:gnudiff
ID: 12043531
Hello, and thanks for the answers.

The answer by solution46 is good, however, as I have no idea what might the upcoming category levels be, I think I will feel safer with the recursive table option given by SjoerdVervweij - the point about the 6th level might be well on spot here. Also the DBMS magazine links were very helpful.

I still have to think about the model, but thanks to both of you, I can now evaluate my options better.
0
 
LVL 9

Expert Comment

by:solution46
ID: 12045818
Fair enough - if the structure is not fixed, then SV's answer is definately better.

Glad you're sorted,

s46.
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Microsoft Access is a place to store data within tables and represent this stored data using multiple database objects such as in form of macros, forms, reports, etc. After a MS Access database is created there is need to improve the performance and…
This article shows how to get a list of available printers for display in a drop-down list, and then to use the selected printer to print an Access report or a Word document filled with Access data, using different syntax as needed for working with …
This is a high-level webinar that covers the history of enterprise open source database use. It addresses both the advantages companies see in using open source database technologies, as well as the fears and reservations they might have. In this…
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …

721 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question