The consideration of chosing one table vs two mysql?

HI, I was wondering the general consideration when creating a new table in the db between choosing one table or two.

In my case I need to create table/tables that will store keywords from messages(there are two types of messages type a and b) and the number of
occurrences   of the word in each of the different messages type. also I will need indication for hoe many total words are in in each of the different messages type.

so my question is I can eithe create one table :
word,typeAocc,typeBocc

and than every time I will need to get the total words for each words It will go over the whole table

or two table
1 wordTypeA wordType B

and than I can just count the number of rows for each table which is probably stored somewhere

but I have 2 table

anyway Its might be a stupid question but Ive been wonder about this type of issues for a while

Thanks you

Nura111Asked:
Who is Participating?
 
AngryBinaryCommented:
The way that I would approach it would be to generalize the solution a little bit, and actually go for a 2 or 3 tables...

One table for emails, one for keywords (this is the optional table... if you don't store any metadata about the keyword, then it isn't necessary), and a join table between the two with count data:

Emails
------------------
email_id
email_type ('a' or 'b')
email_text (maybe you don't need this)

Keywords
------------------
word
(you can include an additional column to include variants on the word, if you have similarity and soundex algorithms in place - for instance, "Viagra" and "Vaigra")

KeywordsInEmails
------------------------
email_id
word
occurance

So, you can get the following data from this model:

1. How many times a keyword appears in one email
2. How many times a keyword appears in a type of email
3. How many emails of either type (or total) that contain this word
4. The percentage likelihood that an email containing the keyword will be spam
5. Draw relationships between multiple words in a single email

This means, you can make decisions about which words bias towards spam, which words combined may indicate spam, the frequency of certain words in spam, and whether or not you've collected enough data to draw a statistically significant conclusion.

And, the model is extensible if you need to create different classifications of emails (i.e., there's outright spam, there's outright non-spam, and then there's opt-in sort of stuff which can go either way). Or, incorporate additional data by extending the model, such as what links are present in an email and if they match any blacklists.
0
 
merwetta1Commented:
I believe you want the one table approach. You would have the table indexed by the word, then whenever you needed to update the count, you could do that with something like this:

update table_name set TypeAocc = TypeAocc + 5 where word = 'theword'

The 2 table approach would force you to do separate queries on the tables when it came time to retrieve values...a JOIN wouldn't work, because there's no guarantee both tables would have all the same words. So the one table approach seems better for your needs.
0
 
AngryBinaryCommented:
I think this particular data modeling decision can be better informed by how you intend to use that data. Since it looks like your implementing a intelligence tool (something that aggregates information that you use to make decisions about other things) the typical rules about normalization go out the window (since by counting something from another record and storing it elsewhere, you have already de-normalized the data), and you want to focus primarily on what information you are interested in, and what is the most efficient way to retrieve it.

So, what is the high-level question are you trying to answer with this information? The low-level question is, of course, what you have already indicated - how many times did word XYZ appear in message type A and message type B. A high-level question is the the one that lead you to start collecting this data, something like "Do people mention 'toilet paper' more on AIM or on Skype?"
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
Nura111Author Commented:
merwetta1:the problem was when i want to get the total number of words for each type.
So I will have to do something like select count(*) where typeAocc > 0

And I have to do that a lot and I thought that every time it will have to go throw all the table to check that and that its not efficient (but I dont understand a lot in that subject so im just assuming)

AngryBinary: basically what im doing is storing keyword from spam and non spam emails. and than every time I get an email I'm performing a Bayesian spam filtering so I will check the probability for each keyword from spam and non spam email so I will need  the TypeAocc and TypeBocc for each keyword.
And now I relized that I need spmehow to store the number of emails I checked (I thought It will be enoufh the number of keywords but its not)

so to add to my question how can I store that as well because it doesnt make sense to add this field to every row right? than everytime I change that I will need ot change it for every row
0
 
Nura111Author Commented:
ok now im confused I already have the emails in an exsiting table with other inof but I dont want a realition to an email beside the number emails counted for every group spam/non spam)
0
 
AngryBinaryCommented:
The model does give you the number of emails of a particular type that a word appears in:

select count(1)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'

To get the frequency of the word in spam emails (the average number of times it appears in each email where it does appear), it would be:

select avg(occurance)
from keywordsinemails k
inner join emails e
on e.email_id = k.email_id
where email_type = 'spam'
and k.word = 'vacation'

Basically, you get more information from your data at minimal additional cost, the cost being whatever is required to capture the number of occurrences of each word in each email when you build your data set. If you don't want it, then leave that out.
0
 
Nura111Author Commented:
okay Thank you but if I don't want the relation to the emails is it make sense to do one table

keywords:
word,spampcc,nonspamocc,totalspamemail,totalnonspamemail

but the totals will be the same number each row.

The thing is I have the word count I want to insert the data inot the db and than update it for every email

It doesnt make sense?




0
 
AngryBinaryCommented:
No, it doesn't make sense to put all the keywords in the same column as the email.

In order to search a large number of records, your database will have to do a full table scan and a string search through every single record.

If the words are kept in a separate table, one word per row in an indexed column, searching will be much more efficient - queries will run many times faster, in fact.

I don't understand the second part of your question - I think you're typing too fast.
0
 
Nura111Author Commented:
<<No, it doesn't make sense to put all the keywords in the same column as the email.  
which email? Im not saving the email just a number of the total emails that were used

what I mean I don't have the information for every keyword from each email its retrieved.

So I cant do it the way you suggested considering that what is the best way to your opinion?

Thank you.
0
 
Nura111Author Commented:
and their is still one word per row  in the one table I suggested
0
 
AngryBinaryCommented:
When I talk about "the email" in the database, I'm not necessarily talking about the message. I just mean a record in the database that identifies that an email exists somewhere with an arbitrary ID, so that we can use that ID to relate it to other things. Identifying individual emails provides a higher degree of granularity to the data so that the data can be sliced different ways to provide different sorts of metrics.

You seem to like the idea of storing the data in a single table with the columns:

     word | count A | count B

... so, why not do that? The model I presented is how I would have approached the problem; I work a great deal in business intelligence and reporting, and that model provides the most possible contexts for analyzing the data.

But, if it's just the single context that you are interested in, and you are reasonably certain that it's the only one you'll ever need from the data, then the single table approach works, especially if it's the one that most closely aligns with your understanding of the information that you're storing.
0
 
Nura111Author Commented:
yes but the problem is its need to be  word | count A | count B |totalA |totalb

and its just seem stupid to me now (that totalA totalb are the same number in each row and I will need to update them in the all table everytime they r changing
0
 
AngryBinaryCommented:
TotalA and totalB are what exactly? If they change based on the data, they are calculated values and shouldn't be stored, they should be calculated when you need them. If its a sum of all counts, then you can simply ask the database for the sum of all the numbers in the "count" column.
0
 
Nura111Author Commented:
Are the total number of the emails that was checked for the keyword. I cant calculate after I need to save it when the keyword is found.
0
 
Nura111Author Commented:
Ok so what do you think about one table as you suggested
Emails
------------------
email_id
email_type ('a' or 'b')
email_text (maybe you don't need this)
keywords (1/0) indicate the email is checked for keywords

KeywordsInEmails
------------------------
word
occuranceA
occuranceB

------------------------
and than If I want to see the total number of email checked for keyword  im checking  select count(*) from Emails where type = a AND keywords =1.

but I techenically dont have any relation between emails table and keywords

Is that make sense?


0
 
AngryBinaryCommented:
If that's the only information you are interested in, then I guess it makes sense. But, it seems like you are arbitrarily limiting the information that you can derive from your data by failing to maintain a relationship between the email and the keyword.
0
 
Nura111Author Commented:
ok I dont see any reason why I want to know from which email id the keyword came from (beside if its spam or not) Do you?
and in the way that im initaily counting the words im counting it as it was one text (the whole emails together)
so I dont have that inofrmation
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.