Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 295
  • Last Modified:

Data Modeling (Entities) [track emails]

This is not an academic assignment.

This question involves entities only, no attributes. Do you agree with the proposed entities? Is another entity needed for the text of the e-mail message body?

I am designing a database that will be used to track the emails that are contained in PDFs. Numerous emails are printed to a single PDF file. Each PDF contains numerous emails. The PDFs are not going to be stored in a database. Only the file path and file name will be tracked. This will be a single user database.

Sent e-mails. The following will be tracked:
1) The date and time the e-mails were sent.
2) The sender’s first and last name.
3) The text of the message body will be copied into a field.

E-mail Images. The following will be tracked:
1) An image of every e-mail. A multipage PDF will contain an image of every e-mail.
2) The e-mail image file path.

Here is a sample report:
Sender: Mary Smith
Date: 1-1-14
Time: 4:00 p.m.
File c:\email\file1.pdf
Email Contents: .. text from e-mail...


Proposed entities:
Emails
Drives
Files
File Types
Paths
Senders
0
Mark01
Asked:
Mark01
  • 10
  • 6
  • 2
3 Solutions
 
Jim Dettman (Microsoft MVP/ EE MVE)PresidentCommented:
I'd question having to have "paths" as a separate entity, unless your doing something along the concept of libraries.

And despite Scott's comments in that first thread, I think your trying to way over think this.   This is not a design that is going to be used across an enterprise and support thousands of users.

I'm not saying don't to it right, but the design is straight forward enough.  It's not going to be obvious until you come up with attributes and start plugging in data and asking all the questions that need to be answered (inquiry / reporting) whether or not your entities are correct.  Since only you have the complete picture and understand the need at this point, your the only one that will be able to tell if it's right or not.   Since your just starting out in DB design, your not going to know that until you work through it.

 And by "plugging in data", I don't mean actually developing the database, but either just roughing out the tables in Access or thinking through it on paper.

 start with what you have now, do the attributes, and then start applying the first three normalization rules.  That will tell you quickly enough what needs to change (if anything).

FWIW,
Jim.
0
 
Scott PletcherSenior DBACommented:
Separating paths is basic normalization: you don't want to have to repeat the same lengthy path value over and over.  You could have, say, 15,000 files in the same folder -- you wouldn't want to waste space storing the full path name 15K times.  Technically, we could (should?) normalize it further, using a hierarchy of our own for the path, to store each folder or subfolder separately, with a table containing parent path links.  I think that's overkill here, and that a de-normalized "one entry per path" is good enough.

I still believe it is 100% wrong to go straight to "tables" instead of doing a logical design.  That said, of course, the total time and effort spent on the logical design should be limited to what is appropriate for that project.  However, the first 2-3 logical designs will take longer, as it's a new process.  And personally I'd err on a mild over-logical-design rather than under-, given how many projects become absolute messes because NO logical design was done.

It's an extraordinarily valuable phase of the project if you record the definitions and rules and, esp., if you can make some of that available as "Help" from within the app.  By this design process, you'll have a vastly better understanding of data requirements, and thus a better app, and built in solid basic documentation.  That helps bring new developers and users up to speed on the app much more quickly than without it.
0
 
Jim Dettman (Microsoft MVP/ EE MVE)PresidentCommented:
<<Separating paths is basic normalization: you don't want to have to repeat the same lengthy path value over and over.  You could have, say, 15,000 files in the same folder -- you wouldn't want to waste space storing the full path name 15K times.>>

  That would depend on an the app and goes to your point about how things will be done.   If everything ends up in the same folder, then I would not even do a table, but simply have a single value as an "App Config" item; all PDF's are stored here <value>

  But the second you say "I want to organize the files in some way", now your storing multiple individual paths, the extreme of that being every file in a different place.

  Now the question is, is the path an entity unto itself or an attribute of the file?  Right now, since a path has no attributes outside of the value itself (Who created it, on what date/time, what it pertains to, etc), then I would argue no (keep in mind the extreme example of every file in a different place) and why bother?  Your just overly complicating the design.   If you keep it separate, think about the process that now needs to take place when I move a file to a new location.

 <<It's an extraordinarily valuable phase of the project if you record the definitions and rules and, esp...>>

  Don't disagree, but the problem is that since he's new to data modeling, he doesn't even know at this point how to ask the right questions.   That's why I think a "hands-on" approach for someone just starting out is helpful and for the moment, you forget about a lot of the technical details.

  What he's looking to do is not overly complicated, nor will it serve as the foundation for some enterprise piece of software.    If it doesn't come out right and takes a few iterations in design to straighten it out, it's not going to be a big deal and a lot will be learned in the process.

 The need for more or less entities will be brought home by trying to plug in data and finding out if records end up in the same shape or not.

 I just think data modeling is one of those things that you learn more by "doing" rather than sitting there trying to think of everything possible under the sun.

Jim.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Mark01Author Commented:
Is the Emails entity appropriate for the text of the e-mail message body? The text of the message body will be copied into a field. The e-mail message text will be heavily used in reports.
0
 
Scott PletcherSenior DBACommented:
>> I just think data modeling is one of those things that you learn more by "doing" rather than sitting there trying to think of everything possible under the sun. <<
I've done many logical data models for $25B+ companies.  Since you skipped the logical data modeling part completely, preferring to go straight to discussing tables, I assumed you considered it not worthwhile, which your follow-up comments seem to indicate as well.  Again, we just fundamentally disagree on that point.


>>  If everything ends up in the same folder, then I would not even do a table, but simply have a single value as an "App Config" item; all PDF's are stored here <value> <<
Again, I'm not going to skip the logical design phase and assume I know in advance the best physical development approach to take.  Logical data design is NOT related to programming/development, and to do proper design, you MUST completely ignore the latter.


>> Now the question is, is the path an entity unto itself or an attribute of the file? <<
That's a reasonable point.  I'm willing to go either way on that, so I accepted the poster's preference to make it an entity.  The design doesn't have to be textbook perfect, after all, just thought out and reasonable.  I believe either approach is valid enough to be used.


>>  Right now, since a path has no attributes outside of the value itself (Who created it, on what date/time, what it pertains to, etc), then I would argue no (keep in mind the extreme example of every file in a different place) and why bother?  Your just overly complicating the design. <<
I deeply disagree.  How is a single-level logical key lookup "too complicated"?  Each individual id points to a single, specific full path.  
The other issue is "right now" ... ah, yes, but what happens when:
1) files move into a "cloud"
2) I need to simulate a recovery with file(s) coming from a recovery site
etc.  If I have a straightforward single-id = location, all I have to do is add an id for each new location, and change a file(s) to that id to use a different location.  I don't have to overwrite existing path names, which is much more difficult to undo accurately.  Now that gets overly complicated!
0
 
Scott PletcherSenior DBACommented:
>> Is the Emails entity appropriate for the text of the e-mail message body? The text of the message body will be copied into a field. The e-mail message text will be heavily used in reports. <<

First just get all the different entities -- and possible entities -- and attributes listed.  Then begin to organize them.  Normalization rules will mostly guide you from there.  You should go to at least 3NF.  Don't worry if it seems too difficult, and esp. lengthy, at first.  Once you gain experience with this, you'll be able to do in 2 hours what now takes 4 days.
0
 
Scott PletcherSenior DBACommented:
One other reason path cannot really be just an attribute is that it leads to the lost data problem.

Say folder "c:\data\" has 3 files.  I decide to move them to "d:\data\".  If I overwrite them, I've now lost the fact that "c:\data\" even exists as a valid data location!  When I add a new file to that location tomorrow, I've got to recreate that data -- and, in the meantime, I've lost all history, security and other info about that location.

Again, I could probably make a case for ignoring that type of thing in specific situations and allowing path to be an attribute, but in this specific case I personally certainly would not, as in a file-saving system, the path seems highly significant overall to me, for both documentation and work purposes.
0
 
Jim Dettman (Microsoft MVP/ EE MVE)PresidentCommented:
<<If I overwrite them, I've now lost the fact that "c:\data\" even exists as a valid data location!  >>

  But is it?  What says C:\data is still valid?

  That's why I made the statement about introducing the concept of libraries to organize files.   The question is, is a drive and path only valid because a file is there or no?  If yes, then making paths an entity is just complicating the design.   If no, then it takes on some form of meaning.  What is that meaning other than a specific path?  There must be some other attributes that go with it or why bother with defining "a location".

 In regards to a file move, you say creating a new record  in the "Path" table, then updating an ID in the file table is some how less complicated then simply updating a single field in the file table?    That sure seems like more operations to me.  And now  anytime I work with a file, I will have to do a join.

Jim.
0
 
Scott PletcherSenior DBACommented:
Ok, we just disagree on this.  I agree with the OP that path should be a separate entity period in this case, and you don't.  OP should have enough by now to decide for himself which way he wants to go on this.
0
 
Scott PletcherSenior DBACommented:
>>  And now anytime I work with a file, I will have to do a join. <<

So you're implying that the design should be denormalized, i.e. corrupted, in advance just so that developers can do one less join?  

If you want to remove the complexity of the joins, create a view that eliminates the need for developers to have to write their own joins.  

If that join turns out to be too much overhead in a given situation, the physical design can be adjusted to deal with it, but the logical design should never be corrupted in advance.
0
 
Jim Dettman (Microsoft MVP/ EE MVE)PresidentCommented:
<<So you're implying that the design should be denormalized, i.e. corrupted, in advance just so that developers can do one less join?  >>

  How is it denormalized?   Do you think because multiple instances of something happen to share the same value, something is denormalized?

 If I take a temperature reading each day for one week and record it, is the design denormalized because all seven days in a row happen to be 80 degrees?  

 I think your prior comment to this last one was the best.

Jim.
0
 
Scott PletcherSenior DBACommented:
>> How is it denormalized?   Do you think because multiple instances of something happen to share the same value, something is denormalized?

  If I take a temperature reading each day for one week and record it, is the design denormalized because all seven days in a row happen to be 80 degrees?  
<<

Seriously?  A path shared by multiple files is not data about the entity.  It could be considered metadata.  Repeating it just wastes disk space and causes consistency errors, as denormalization almost always does.

Temperatures are clearly data and repeating values wouldn't necessarily represent denormalization.

At the core, it takes experience and judgment to know when to normalize larger data out of a table.

If the temperature readings were a large number of bytes, for whatever reason, and they tended to repeat in patterns, I might suggest normalizing those as well to reduce errors and improve consistency.

Getting back to the paths issue, I guarantee you that forcing people to repeatedly enter a full path and file name, and then not verifying it, will result in bad data.  Guaranteed!  That is, instead of:
c:\some long\path\name\to\a file\rebeccas_data\filename.txt
you'll get entry(ies) like:
c:\some long\path\name\to\a file\rebeccasdata\filename.txt --no underscore

With a separate list of valid paths, I can give the users a list to choose from and avoid that.

Also, when I normalize the data, when a new path is created, I can verify that the path actually exists (and, if necessary, allow an override to specify that it should be created even if it doesn't currently exist).  

I can later also easily re-verify each unique path to check for those that have become invalid or no longer exist.

If all the path names are stored directly in a column, just think of all the steps you'd have to go thru to validate those entries and find paths that have been removed.

Admittedly these issues and concerns may not immediately be thought of.  But experience allows one to jump ahead and see these issues before they develop.
0
 
Jim Dettman (Microsoft MVP/ EE MVE)PresidentCommented:
<<Seriously?  A path shared by multiple files is not data about the entity.>>

 Hum....well then give me an example of a file without a path.

Jim.
0
 
Scott PletcherSenior DBACommented:
>>
<<Seriously?  A path shared by multiple files is not data about the entity.>>
  Hum....well then give me an example of a file without a path.
<<

As I noted, it's metadata, not data.

What about the other issues?  How do you validate and facilitate entries with no normalized data paths??
0
 
Jim Dettman (Microsoft MVP/ EE MVE)PresidentCommented:
<<Temperatures are clearly data and repeating values wouldn't necessarily represent denormalization.>>

   You have yet to explain how that's any different then repeating a path in a file table.   Let's flip it around; I place ten files in ten separate directories.  Each path value is now unique.

<<As I noted, it's metadata, not data.>>

 Come on, you made the statement that the path was not about the entity of "a file".  That being the case, then a file should be able to exist without a path, so I'm asking you to give me an example of a file without a path.  Very simple and just as simply, you can't.   You can't have a file without specifying a location, so I would argue that a path is an attribute of a file.   You can't talk about a file without talking about its location.

  All the rest of the stuff your bringing up about validation, having a user choose a valid path, etc, I can do just as easily whether the value for "path" is stored in a separate table or as part of the file table.  

  As I said, your best comment was that we are in disagreement on this and I'm quite sure Mark is tired of the conversation by now and just wants an answer to his question.

  So go ahead, carry on, and see where it ends up.

Jim.
0
 
Scott PletcherSenior DBACommented:
>> You can't have a file without specifying a location, so I would argue that a path is an attribute of a file. <<

A path can exist without a file.  That alone qualifies path as its own entity.  [Indeed, a path must first exist without a file in it.]

Similarly, a course in a college still exists even if no student is currently enrolled it.  The courses are stored separately so course info is not lost when a course is not offered for a given semester.
0
 
Scott PletcherSenior DBACommented:
I'm sure Mark has moved on.  Although new to this activity, he's very clearly interested in doing an actual logical design instead of trying to demonstrate that somehow one isn't needed.
0
 
Mark01Author Commented:
Thank you, Jim Dettman and Scott Pletcher.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 10
  • 6
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now